Free energy is a mathematical function that can be employed in statistical inference. This post offers a few different ways to understand its definition.

### Free energy guides statistical inference

You perform statistical inference when you observe some data \(\class{mj_red}{x}\), and update your degrees of belief \(\class{mj_yellow}{q(\class{mj_blue}{w})}\) in unobservable states of the world \(\class{mj_blue}{w}\).

In order for this kind of inference to be useful, the unobservable states must be statistically related to the observable data. In other words there must be a statistical relationship \(\class{mj_grey}{p(\class{mj_blue}{w},\class{mj_red}{x})}\) between \(\class{mj_blue}{w}\) and \(\class{mj_red}{x}\).

Free energy, which we will label \(F\), is a function of \(\class{mj_grey}{p},\class{mj_yellow}{q}\) and \(\class{mj_red}{x}.\) One way free energy can guide inference is via the following rule: upon observing \(\class{mj_red}{x}\), and assuming \(\class{mj_grey}{p}\), your degree of belief \(\class{mj_yellow}{q}\) should be chosen so that \(F\) is as small as possible.

For an accessible treatment of free energy in statistical inference see my post The simplest possible model of the free energy principle.
If you haven’t read it yet, *and if* you are unfamiliar with statistical inference, you might want to give it a go before reading on.

In the rest of this post, we will be trying to understand how to interpret \(F\). Our target is an intuitive understanding of why minimising it is a desirable goal of statistical inference. The following example will help illustrate the discussion, but go to the post linked above to see the example worked through to its conclusion.

### Example: where’s my cat?

You have a cat that spends its time in either the \(\class{mj_blue}{\text{kitchen}}\) or the \(\class{mj_blue}{\text{bedroom}}\). When it’s in the kitchen, it often \(\class{mj_red}{\text{meows}}\) for food; when it’s in the bedroom, it often \(\class{mj_red}{\text{purrs}}\) loudly.

Suppose you tally the proportion of the times your cat is in each place and making each noise. The results might look something like this:

$$ \begin{equation*} \begin{array}{cc} & & \class{mj_red}{\text{Cat noise}}\\ & & \begin{array}{cc} \class{mj_red}{\text{meow}} & \ \class{mj_red}{\text{purr}} \end{array}\\ \class{mj_blue}{\text{Cat location}}& \begin{array}{c} \class{mj_blue}{\text{kitchen}}\\ \class{mj_blue}{\text{bedroom}}\end{array} & \left(\begin{array}{c|c} 40\% & 20\%\\ \hline 10\% & 30\% \end{array}\right) \end{array} \end{equation*} $$

Now suppose you are in the living room and you hear a \(\class{mj_red}{\text{meow}}\). You can’t tell whether the sound came from the \(\class{mj_blue}{\text{kitchen}}\) or \(\class{mj_blue}{\text{bedroom}}\), but you do know the statistics given in the table above. Statistical inference is the process of answering the question: what is the probability of the cat being in one location or the other, given that we heard it \(\class{mj_red}{\text{meowing}}\)?

### Inputs to the free energy function

Free energy takes three inputs, and gives a real number as output. For clarity, here are those inputs again, with reference to the cat example:

- \(\class{mj_red}{x}\): the observable data on the basis of which you will perform inference. In the example above, we said that we heard the cat \(\class{mj_red}{\text{meowing}}\), so \(\class{mj_red}{x=\text{meowing}}.\)
- \(\class{mj_yellow}{q(\class{mj_blue}{w})}\): your degree of belief in unobservable parts of the world, after having performed inference. This is a
*probability vector*(mathematician-speak for “a list of numbers that add up to \(1\)”). For example, \(\class{mj_yellow}{q(\class{mj_blue}{w})} = \left(\frac{6}{10},\frac{4}{10}\right)\) means you are \(60\%\) sure the cat is in the \(\class{mj_blue}{\text{kitchen}}\) and \(40\%\) sure the cat is in the \(\class{mj_blue}{\text{bedroom}}.\) (We’re assuming the cat can*only*be in one of those two places.) - \(\class{mj_grey}{p(\class{mj_blue}{w},\class{mj_red}{x})}\): the statistical connection assumed to hold between observed data \(\class{mj_red}{x}\) and unobserable parts of the world \(\class{mj_blue}{w}\). This is defined by the table above. To make it mathematically correct, we convert the percentages to fractions, so \(\class{mj_grey}{p(\class{mj_blue}{\text{kitchen}},\class{mj_red}{\text{meow}})}=\frac{4}{10}\) and so on.

When we say the statistical connection \(\class{mj_grey}{p}\) is “assumed” to hold, we mean both that you are operating under the assumption that \(\class{mj_grey}{p}\) is the correct statistical connection, *and* that this assumption is correct.
In other words: \(\class{mj_grey}{p}\) is true and you make inferences as though it were true.
(More complex forms of inference weaken this assumption, and describe how to update \(\class{mj_grey}{p}\) on the basis of multiple observations \(\class{mj_red}{x_t}, \class{mj_red}{x_{t+1}},…\). In this post we will ignore those more difficult ideas.)

All right. Free energy has three inputs. But what does it do with them?

### Free energy: Compact form

There are three useful ways to write free energy, as far as I know. Here is the first, which I call “compact form”, because it’s shorter than the other two:

$$ F= \class{mj_green}{\sum_{\class{mj_blue}{w}} \left( \class{mj_yellow}{q(\class{mj_blue}{w})} \class{mj_lavender}{\log \left( \frac{\class{mj_yellow}{q(\class{mj_blue}{w})}} {\class{mj_grey}{p(\class{mj_blue}{w},\class{mj_red}{x})}} \right)} \right)} $$

In a somewhat futile attempt to mimic this incredible colourised explanation of the discrete Fourier transform, I offer the following definition:

Free energy is the average value (weighted by your degree of belief in hidden states) of the log ratio of your degree of belief in hidden states to the assumed statistical relationship between hidden states and observed data.

Or, in the language of cats:

Free energy is the average value (weighted by your degree of belief in where the cat currently is) of the log ratio of your degree of belief in where the cat currently is to the assumed statistical relationship between where the cat currently is and what sound the cat is currently making.

This description might be correct, but it isn’t particularly enlightening.

The main benefits of the compact form are that it’s relatively easy to remember, and the other two forms can be derived from it.
Other than that, it doesn’t provide a great deal of insight into what \(F\) is actually *for*, and why minimising it is a useful thing to do.
For that, we turn to the Bayesian form.

### Free energy: Bayesian form

If you ask me, the best way to think of free energy is as **a measure of the cost of inaccuracy in your degrees of belief**.
Intuitively, there are two ways your degrees of belief can be inaccurate, and so two kinds of cost they might incur.
Free energy describes how these two costs should be balanced against each other.

On the one hand, you shouldn’t hold degrees of belief \(\class{mj_yellow}{q(\class{mj_blue}{w})}\) that are inordinately far away from what you know the true statistics to be, \(\class{mj_grey}{p(\class{mj_blue}{w})}.\)
In short: **Don’t overfit**.
You overfit when you believe something that makes the currently observed data very probable, at the expense of a wider set of possible data.

On the other hand, you shouldn’t hold degrees of belief \(\class{mj_yellow}{q(\class{mj_blue}{w})}\) that fail to explain the data you are presently observing, \(\class{mj_grey}{p(\class{mj_red}{x}|\class{mj_blue}{w})}\).
In short: **Don’t be conservative** when accounting for new data.

Free energy is equal to a sum of two terms that represent these costs:

$$
F=
\underbrace{
\class{mj_green}{\sum_{\class{mj_blue}{w}}
\left(
\class{mj_yellow}{q(\class{mj_blue}{w})}
\class{mj_lavender}{\log
\left(
\frac{\class{mj_yellow}{q(\class{mj_blue}{w})}}
{\class{mj_grey}{p(\class{mj_blue}{w})}}
\right)}
\right)}}*{
\substack{
\text{Don’t overfit} \\
\text{(i.e. Don’t stray from your prior beliefs)}
}
} + \quad
\underbrace{
\class{mj_green}{\sum*{\class{mj_blue}{w}}
\left(
\class{mj_yellow}{q(\class{mj_blue}{w})}
\class{mj_lavender}{\log
\left(
\frac{1}
{\class{mj_grey}{p(\class{mj_red}{x}|\class{mj_blue}{w})}}
\right)}
\right)}}_{
\substack{
\text{Don’t be conservative} \\
\text{(i.e. Don’t make the data too surprising)}
}
}
$$

The first term is **relative entropy** from \(\class{mj_grey}{p}\) to \(\class{mj_yellow}{q}\).
This is a standard way to measure the distance between two probability distributions.
(There’s a *lot* to be said about relative entropy – eventually I’ll write another post on it.)
Because it can be treated as a measure of how far away your degrees of belief are from your prior expectations, it penalises you for straying too far.

The second term measures how **surprising** the data you just observed, \(\class{mj_red}{x}\), would be if you were to believe \(\class{mj_yellow}{q(\class{mj_blue}{w})}\).
You shouldn’t believe something that makes the data you just saw too surprising.

This is sometimes described as “explaining” the data, and that sort of makes sense: if you believe the cat is in the \(\class{mj_blue}{\text{kitchen}},\) for example, and it turns out that \(\class{mj_grey}{p(\class{mj_red}{\text{meowing}}|\class{mj_blue}{\text{kitchen}})}\) is high, you can explain why you heard \(\class{mj_red}{\text{meowing}}\) by appealing to the cat being in the \(\class{mj_blue}{\text{kitchen}}\). Mathematically, higher values of \(\class{mj_grey}{p(\class{mj_red}{x}|\class{mj_blue}{w})}\) should be matched with high values of \(\class{mj_yellow}{q(\class{mj_blue}{w})}\) to make that right-hand term low.

In short: the fact that proposed degrees of belief \(\class{mj_yellow}{q(\class{mj_blue}{w})}\) stay close to prior expectations, at the same time as explaining your data, justifies you in adopting \(\class{mj_yellow}{q(\class{mj_blue}{w})}\).

Why do I call this “Bayesian form”?
Because it exemplifies the trade-off at the heart of Bayesian inference.
In Bayesian inference, we calculate the correct degrees of belief by multiplying our prior belief in \(\class{mj_blue}{w}\) by a term that describes how probable the observed data \(\class{mj_red}{x}\) would be if \(\class{mj_blue}{w}\) were true.
**Bayes’ rule** –

$$
\class{mj_yellow}{q_{}(\class{mj_blue}{w})}
\quad\text{ought to be equal to}\quad
\underbrace{
\class{mj_grey}{p(\class{mj_blue}{w})}
\vphantom{\frac{\class{mj_grey}{p(\class{mj_red}{x}|\class{mj_blue}{w})}}{\class{mj_grey}{p(\class{mj_red}{x})}}}
}*{
\substack{
\text{Prior belief}\\
\text{in }\class{mj_blue}{w}
}
}
\quad
\underbrace{
.
\vphantom{\frac{\class{mj_grey}{p(\class{mj_red}{x}|\class{mj_blue}{w})}}{\class{mj_grey}{p(\class{mj_red}{x})}}}
}*{
\substack{
\text{multiplied}\\
\text{by}
}
}
\quad
\underbrace{
\frac{\class{mj_grey}{p(\class{mj_red}{x}|\class{mj_blue}{w})}}
{\class{mj_grey}{p(\class{mj_red}{x})}}
}_{
\substack{
\text{Relative probability}\\
\text{of data given }\class{mj_blue}{w}
}
}
$$

– tells you implicitly what \(F\) tells you explicitly: that you ought to balance these two kinds of costs in this particular way. On the other hand, Bayes’ rule tells you something explicitly that \(F\) reveals only implicitly: what \(\class{mj_yellow}{q_{}(\class{mj_blue}{w})}\) ought to be! Fortunately, the next formulation of free energy makes that important piece of information a bit more explicit.

### Free energy: Upper bound form

$$
F=
\underbrace{
\class{mj_green}{\sum_{\class{mj_blue}{w}}
\left(
\class{mj_yellow}{q(\class{mj_blue}{w})}
\class{mj_lavender}{\log
\left(
\frac{\class{mj_yellow}{q(\class{mj_blue}{w})}}
{\class{mj_grey}{p(\class{mj_blue}{w}|\class{mj_red}{x})}}
\right)}
\right)}}*{
\substack{
\text{How far you are from} \\
\text{the correct posterior}
}
} +
\underbrace{
\frac{1}
{\class{mj_grey}{p(\class{mj_red}{x})}}
}*{
\substack{
\text{How surprising} \\
\text{is the observed data}
}
}
$$

Again the first term is a relative entropy measure. However, this time it’s from the correct posterior \(\class{mj_grey}{p(\class{mj_blue}{w}|\class{mj_red}{x})}\) to your degree of belief \(\class{mj_yellow}{q(\class{mj_blue}{w})}\). This term penalises you for straying from the correct posterior – which makes perfect sense, given that you want to believe the truth.

(A wrinkle: relative entropy is usually invoked by putting the “correct” distribution and the “approximating” distribution the other way round than they are written here. The question raises itself: how should relative entropy be interpreted when its component distributions are swapped? Although it still counts as a measure of the penalty of failing to believe the truth, it’s a slightly different penalty than we’re used to. I don’t yet have an answer to this question. Drop me a line if you have any ideas.)

The second term is the overall surprise of the data \(\class{mj_red}{x}\). Importantly, this second term doesn’t depend on \(\class{mj_blue}{w}\). Two conclusions can therefore be drawn:

- \(\class{mj_yellow}{q(\class{mj_blue}{w})}\) ought to be as close to \(\class{mj_grey}{p(\class{mj_blue}{w}|\class{mj_red}{x})}\) as possible. There are no other constraints on your degree of belief. In this form, \(F\) is no longer telling you to balance two costs, it’s telling you just to believe the truth!
- \(F\) is always greater than or equal to \(\frac{1}{\class{mj_grey}{p(\class{mj_red}{x})}}\). This is a consequence of the fact that relative entropy is always greater than or equal to zero.

The second point justifies the name “upper bound form”.
It shows that **free energy is an upper bound on surprise**.

### Why not just always use upper bound form?

Bayesian form gave circuitous advice about what degrees of belief to hold, entreating you to balance the costs of overfitting and being too conservative. But upper bound form is much more explicit, telling you exactly what your degrees of belief ought to be. Surely upper bound form is all we need when doing statistical inference?

There is a significant problem with this idea. The problem arises when you try to compute \(\class{mj_yellow}{q(\class{mj_blue}{w})}\) from \(\class{mj_grey}{p(\class{mj_blue}{w},\class{mj_red}{x})}\) and \(\class{mj_red}{x}\). Bayes’ rule tells you how to do it, but requires an intermediate step in which you calculate \(\class{mj_grey}{p(\class{mj_red}{x})}\). For mathematical reasons I won’t go into here, it is not always possible to calculate \(\class{mj_grey}{p(\class{mj_red}{x})}\) from \(\class{mj_grey}{p(\class{mj_blue}{w},\class{mj_red}{x})}\).

If you cannot calculate \(\class{mj_grey}{p(\class{mj_red}{x})}\), you cannot follow Bayes’ rule. You also cannot use the upper bound form of free energy: without \(\class{mj_grey}{p(\class{mj_red}{x})}\), you cannot calculate \(\class{mj_grey}{p(\class{mj_blue}{w}|\class{mj_red}{x})}\). You need to use a different procedure to choose your degrees of belief.

It turns out that in these situations, it is still sometimes possible to calculate the ingredients of the Bayesian form, \(\class{mj_grey}{p(\class{mj_blue}{w})}\) and \(\class{mj_grey}{p(\class{mj_red}{x}|\class{mj_blue}{w})}\).
In other words, **there are situations in which you have all the components of the Bayesian form of free energy, but lack components of the upper bound form**.

In these kinds of situations, you cannot follow Bayes’ rule and you cannot use the upper bound form. But you have all the ingredients required to calculate the Bayesian form of free energy. You can therefore test different candidate degrees of belief, until you find the \(\class{mj_yellow}{q(\class{mj_blue}{w})}\) that makes \(F\) smallest.

In sum, Bayesian form is useful when you can calculate \(\class{mj_grey}{p(\class{mj_blue}{w})}\) and \(\class{mj_grey}{p(\class{mj_red}{x}|\class{mj_blue}{w})}\), but not \(\class{mj_grey}{p(\class{mj_red}{x})}\). Of course, you can use compact form too, but that doesn’t reveal the rationale behind using free energy for statistical inference.