jeremybernste.in

Machine learning is just statistics + quantifier reversal

Los Angeles, 21 Oct 2021

We all know how deep learning works. You pick a neural architecture that’s a good model of your data—it respects the symmetries and such—you train it by gradient descent on a ginormous dataset, and hey presto! you’ve got yourself a generalising classifier.

Then there are those pesky questions. The kinds of question it’s best just to ignore. You know, sleep soundly… go make a million dollars. But—wait, you’re still here, right?? Phew, okay. Back to those questions. So why exactly does that classifier generalise? If a neural net can represent basically any function, why does gradient descent not return those really weird horrible functions that heavily overfit the training set? In other words, why does deep learning work?

In a recent blog post titled “Machine learning is not nonparametric statistics”, Ben Recht speaks to some of the difficulties in applying classical statistical tools to understand why machine learning works. A core piece of his argument goes as follows: Given a fixed classifier, we can assess the classifier’s population error rate using a sample of data and basic statistics. But in machine learning there’s a switcheroo—we select the sample of data first, and then we use that data to select the classifier. This means those classical statistical tools don’t work anymore. What gives?

It turns out that back in 1998, David McAllester worked out an elegant way to deal with this switcheroo that he called quantifier reversal. By applying quantifier reversal, those classical statistical tools become useful again. So what is quantifier reversal? ~~How can it make me a million dollars?~~ And what can it tell me about why machine learning works? That’s what I’m going to answer in this post!

To give you a taste of what’s to come, quantifier reversal leads to a generalisation bound that:

doesn’t care about the number of parameters in the model;
holds for the bulk of classifiers that fit the data;
and allows for the existence of classifiers that overfit horribly.

What statistics has to say

First of all, it’s going to help to look at what statistics already tells us. Let’s suppose we have a fixed classifier $f$ that we have already chosen in advance, and we want to know how good it is. Formally, we want to know the classifier’s error $\mathcal{E}$ averaged over a distribution of data $\mathcal{D}$. For simplicity, we’ll assume we’re dealing with 0–1 error, defined via:

\[\mathcal{E}(f(x),y):= \begin{cases}0\qquad \text{ if }f(x) = y,\\1\qquad \text{ if }f(x) \neq y.\end{cases}\]

Well, the obvious place to start is to collect a really big sample of $n$ datapoints drawn iid from $\mathcal{D}$ (let’s call this sample $X$) and to approximate the population error $\overline{\mathcal{E}}$ of $f$ by the average error on the sample $X$:

\[\overline{\mathcal{E}}(f):=\mathbb{E}_{(x,y)\sim\mathcal{D}} \, \mathcal{E}(f(x),y) \approx \frac{1}{n}\sum_{(x,y)\in X} \mathcal{E}(f(x),y).\]

This is just Stat 101, people. But let’s go a bit further. Suppose we find that the sample error is actually zero—it’s looking like the classifier is a pretty good one, no? But what does this actually tell us about the population error? What we’re going to do next is derive a rigorous rule for drawing conclusions about the population error $\overline{\mathcal{E}}(f)$ given that $f$ fit $X$ perfectly.

First let’s observe that for 0–1 error, the population error is equivalent to the probability of misclassifying a freshly drawn datapoint:

\[\overline{\mathcal{E}}(f) := \mathbb{E}_{(x,y)\sim\mathcal{D}} \, \mathcal{E}(f(x),y) = \mathbb{P}_{(x,y)\sim\mathcal{D}} \, [f(x)\neq y].\]

Then it follows that the probability that our already-chosen classifier $f$ correctly classifies all $n$ iid samples in $X$ is given by:

\[\mathbb{P}_X\left[f \text{ fits } X \text{ perfectly} \right] = \Big(1-\overline{\mathcal{E}}(f)\Big)^{n}. \tag{1}\]

In words, this says that the smaller the actual error rate $\overline{\mathcal{E}}$ of $f$ is, then the more likely the classifier $f$ is to perfectly fit the fresh data sample $X$. Now, let $0<\delta<1$ be a parameter of our choice, and consider the following rule:

\[\text{If } f \text{ fits } X \text{ perfectly, conclude that }\Big(1-\overline{\mathcal{E}}(f)\Big)^n \geq \delta.\]

In the case that $(1-\overline{\mathcal{E}}(f))^n$ is larger than $\delta$, then clearly this rule cannot fail. Similarly, the rule will not fail us if $f$ does not fit $X$ perfectly. For the rule to fail us, it must be the case that both $(1-\overline{\mathcal{E}}(f))^n < \delta$ and $f$ fits $X$ perfectly. By (1), the probability of this happening is less than $\delta$. This means that the probability of the rule failing in any eventuality is less than $\delta$, which can be neatly summarised by the following statement:

\[\forall f \,\,\forall^\delta X, \,\,\, \left[f \text{ fits } X \text{ perfectly}\implies\overline{\mathcal{E}}(f) \leq \frac{\log(1/\delta)}{n}\right]. \tag{$2$}\]

In this statement, $0<\delta<1$ is still an arbitrary parameter, and $\forall^\delta$ is a special notation used to mean “for all excluding a fraction $\delta$”. We have also sneaked a simplication, since $(1-\overline{\mathcal{E}}(f))^n \geq \delta$ implies that $\overline{\mathcal{E}}(f) \leq \frac{\log(1/\delta)}{n}$.

To summarise, we have used basic statistical tools to derive a rule for assessing the population error $\overline{\mathcal{E}}$ of any fixed classifier $f$. The rule depends logarithmically on a parameter $\delta$, and will only fail on a fraction $\delta$ of possible data samples $X$. Crucially, the rule relies on the fact that the classifier $f$ is fixed first, and that $f$ is not chosen with reference to the sample $X$.

Quantifier reversal

The rule given in $(2)$ is certainly pretty nice, and it motivates the standard practice in machine learning of using a cough fresh cough test set to evaluate the performance of a classifier. But unfortunately it’s not helpful for machine learning training, where we use the same data sample—called the training set—both to select a classifier and to assess its quality.

A very concise way to phrase this issue is that in $(2)$, the quantifiers are the wrong way round—where by quantifiers, we mean the “for all” symbols. The ordering $\forall f \,\,\forall^\delta X$ corresponds to first picking a classifier $f$ and then picking a data sample $X$. What we’d really like is a statement similar to $(2)$ but with the reversed ordering $\forall^\delta X \,\, \forall f$. That would correspond to picking the data sample first, then allowing us to pick the classifier any way we like.

You might be tempted to just take $(2)$ and switch the order of the quantifiers while no one’s looking. (Phycisists, I’m looking at you!) But that’s actually basically illegal, and McAllester would like to have a stern word with you. If something holds for all $f$ for nearly all $X$, it does not necesarily hold for nearly all $X$ for all $f$. To give a simple example of how this kind of reversal can fail, consider a large identity matrix:

\[\left( \begin{array}{ccccc} 1&0&0&\cdots &0\\ 0&1&0&\cdots &0\\ 0&0&1&\cdots &0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots &1 \end{array} \right).\]

While it’s true that for every column nearly every row entry is zero, it is certainly not true that for nearly every row all column entries are zero.

So, what can we say? The thing to notice is that there certainly are a lot of zeroes whichever way you look at the matrix. In fact, given that for every column nearly every row entry is zero, we are safe to say that for nearly every row nearly all column entries are zero.

In his 1998 paper, McAllester worked out really carefully how you’re allowed to reverse quantifiers in a statement like $(2)$. McAllester showed that $(2)$ in fact implies the following quantifier-reversed statement:

\[\displaylines{\forall^\delta X \,\, \forall \alpha>0 \,\, \forall^\alpha f,\qquad \qquad \qquad \qquad \qquad \qquad \qquad \\ f \text{ fits } X \text{ perfectly}\implies\overline{\mathcal{E}}(f) \leq \frac{\log(n/(\alpha\delta))}{n-1}.} \tag{3}\]

To spell out what’s going on, (3) is like (2) except it holds for all classifiers excluding a fraction $\alpha$. There has also been a slackening on the strength of the conclusion we can draw on the population error $\overline{\mathcal{E}}(f)$. But crucially, in (3) the data sample $X$ is chosen first, and aside for a fraction $\alpha$ of classifiers that are forbidden, we can choose $f$ however we like—$f$ can even depend on the data sample $X$. Machine learners, we’re in business!

A PAC-Bayesian theorem

It’s at this point in the post that I should tell you I’ve been teaching you PAC-Bayes theory without you even realising it. How devious! Only one more step is needed to get a really useful result. I’m going to present a slight twist on McAllester’s classic theorem that recently appeared in a paper by Guillermo Valle-Pérez and Ard Louis at the University of Oxford.

The idea is that while (3) holds for nearly all classifiers ($\forall^\alpha f$), we really only care about understanding those classifiers that fit $X$. We can restrict (3) to only range over those classifiers that fit $X$ perfectly, but since a proportion $\alpha$ of all classifiers are excluded, we’ll just need to be careful to work out what proportion of classifiers that fit $X$ need to be excluded.

Given a training sample $X$, we’ll let $\mathbb{P}_f(X)$ denote the proportion of classifiers that fit $X$ perfectly. In other words, $\mathbb{P}_f(X)$ is the probability of sampling a classifier $f$ at random and finding that it fits $X$. Then we have the following two ratios:

\[\begin{aligned}\text{classifiers that fit $X$ perfectly} &: \text{all classifiers } = \mathbb{P}_f(X),\\ \text{excluded classifiers} &: \text{all classifiers } = \alpha. \end{aligned}\]

By dividing these ratios, we observe that:

\[\text{excluded classifiers} : \text{classifiers that fit $X$ perfectly} = \alpha / \mathbb{P}_f(X).\]

What this means is that the statement in (3) must hold for all classifiers that fit $X$ excluding a fraction $\alpha/\mathbb{P}_f(X)$. Or in symbols:

\[\displaylines{\forall^\delta X \,\, \forall \alpha>0 \,\, \forall^{\alpha/\mathbb{P}_f(X)} \text{ of classifiers } f \text{ that fit } X \text{ perfectly,} \qquad \qquad \\ \qquad \qquad \overline{\mathcal{E}}(f) \leq \frac{\log(n/(\alpha\delta))}{n-1}.}\]

By a change of variables $\gamma = \alpha / \mathbb{P}_f(X)$, we have reached our destination:

\[\displaylines{\forall^\delta X \,\, \forall \gamma>0 \,\, \forall^{\gamma} \text{ of classifiers } f \text{ that fit } X \text{ perfectly,} \qquad \quad \\ \qquad \quad \overline{\mathcal{E}}(f) \leq \frac{\log(1/\mathbb{P}_f(X)) + \log(n/(\gamma\delta))}{n-1}.} \tag{4}\]

To interpret this statement in words, it helps to first notice that for a large number of datapoints $n$ that the $\log(n/(\gamma\delta))$ term is negligible and can be ignored. Then the statement reads as follows: For nearly all data samples, for nearly all classifiers that perfectly fit the data sample, the population error is smaller than $\log(1/\mathbb{P}_f(X))$ divided by the number of datapoints.

To summarise, we have combined basic statistical tools with quantifier reversal to derive a statement on the population error of nearly any classifier that fits nearly any data sample. In particular, we have achieved our goal of making a statistical inference about population error in a setting where the data sample is fixed before the classifier is chosen.

At this point, it’s worth recapitulating the argument that got us here. Given any classifier, it’s highly unlikely for that classifier to fit a randomly drawn data sample and not generalise well (2). This implies the reverse statement that for nearly all randomly drawn data samples, hardly any classifiers will fit the sample and fail to generalise (3). Finally this implies that, provided $\mathbb{P}_f(X)$ is large enough meaning lots of classifiers actually do fit the training sample, then hardly any of them will fail to generalise (4).

On a historical note, David Mackay conjectured such a relationship between $\mathbb{P}_f(X)$ and generalisation back in 1995. Mackay referred to $\mathbb{P}_f(X)$ as the Bayesian evidence for the data under the family of classifiers. Go Bayesians!

Interpretations

What does all of this mean for machine learning? The main message is that so long as you don’t get really unlucky with your training set, and provided your learning algorithm does not use the training set to select a classifier in a really weird way, then population error will be bounded as in (4).

To take this further, if the number of data points $n$ dominates $\log(1/\mathbb{P}_f(X))$, then for nearly all training sets, nearly all classifiers that perfectly fit the training set will generalise well.

Let’s put this into context for deep learning. While many neural nets might be capable of perfectly fitting a particular training sample, they might need to contort themselves into a really weird weight configuration in order to do so, meaning that $\mathbb{P}_f(X)$ is small. What this theory suggests is that what really matters is that the neural net is able to easily fit the training sample, in the sense that it fits the sample in lots and lots of different ways. Then $\mathbb{P}_f(X)$ will be large and most of those classifiers will generalise by (4).

It’s also useful to compare these ideas to, say, Vapnik-Chervonenkis learning theory. While VC theory derives a bound on the population error of all classifiers, the statement in (4) only holds for nearly all classifiers. The statement in (4) is therefore compatible with the existence of neural nets that overfit horribly—it’s just they must be the exception and not the rule.

Horizons

So is that it? Machine learning is just statistics plus quantifier reversal, and we can all go home? Even taking this theory at face value, life cannot be that simple. Let’s give a few examples of what’s missing:

The theory does not tell us how to design a neural architecture that maximises $\mathbb{P}_f(X)$. That said, a few basic tricks do become obvious: for instance, using weight-tying to rule out any classifiers that violate the symmetries of the data should be a cheap way to ramp up $\mathbb{P}_f(X)$.
The theory doesn’t tell us why gradient descent is able to find classifiers that fit the data in the first place. Although it does suggest that in situations where deep learning should generalise well, $\mathbb{P}_f(X)$ is relatively large and so gradient descent is not exactly seeking a needle in a haystack. But going beyond this sort of plausibility argument is somewhere where the neural tangent kernel series of works could help.

Finally, I want to draw attention to a limitation of the STAT + QR = ML theory itself. As we have discussed, (4) bounds the generalisation error of the bulk of classifiers that fit a certain training sample. What this doesn’t rule out is that certain special classifiers could generalise much better than the bulk!

My advisor and I explore this insight in a recent paper. We point out that large margin classifiers can actually generalise much better than the bulk because they have less random fluctuation in their predictions. What’s more, gradient descent can be used to target these large margin classifiers.

Our finding paints a nuanced portrait of generalisation as relying on the interplay between the bulk properties of the neural architecture as well as the special properties of the network returned by gradient descent.

Closing remarks

An oft-cited mystery of deep learning is the ability of neural networks with gazillions of parameters to generalise, despite having the capacity to overfit horribly. But do we need fundamentally new theories to explain this?

As we have seen in this post, quantifier reversal yields a generalisation bound that applies to the bulk of classifiers, leaving room for some classifiers to overfit. The bound depends on a simple quantity: the proportion of classifiers that fit the data sample. This quantity has no intrinsic dependence on the number of parameters in a neural network.

Quantifier reversal and the resulting PAC-Bayes bound (4) are powerful tools for understanding generalisation. I am hoping that this post can lead to greater awareness of these results in the machine learning community.

References

This post piggybacks relevance off of the following blog post:

Machine learning is not nonparametric statistics by Ben Recht.

The technical content is an exposition of results in the following papers:

Some PAC-Bayesian theorems by McAllester;
Generalization bounds for deep learning by Valle-Pérez & Louis.

We have also mentioned the following papers:

Probable networks and plausible predictions by Mackay;
Neural tangent kernel by Jacot, Gabriel & Hongler;
Implicit biases of architecture & gradient descent by Bernstein & Yue.

If I’m missing any important related work ~~don't hate me~~ please let me know.

Acknowledgements

Thanks to Chris Yeh for feedback on this post.