Maximum Entropy: Extreme values: P = 1 and P = 0

There is a popular folk theorem among some Bayesians, to the effect that it is unacceptable for a probability to be 0 or 1. There's a simple motivation for this principle: as rationalists, we demand the opportunity for nature to educate us by blessing us with novel observations. No matter how confident we become in some proposition, it should always be possible for us to change our minds when strong enough evidence accumulates in favour of some alternative. As Karl Popper rightly observed, after all, a theory that is invulnerable to falsification is not much of a theory.

But what happens if P(H | I) becomes zero? How is the probability for the hypothesis, H, to be updated by new evidence? If P(H | I) is 0 then the numerator in Bayes' theorem, prior times likelihood,

P(H | I) × P(D | HI),

is also 0, regardless how convincing the data, D, may be. No matter what happens, the outcome is unchanged: a nice round posterior.

Similarly, if P(H | I) is 1, then for the converse hypothesis, P(~H | I) is necessarily 0. Now, the denominator in Bayes' theorem is

P(H | I) × P(D | HI) + P(~H | I) × P(D | ~HI)

and when the second term (everything after the plus sign) is zero, both numerator and denominator in Bayes' theorem are the same, producing the ratio 1, for all eternity.

I have sympathy with this motivation, therefore, but as a general rule, it is utter nonsense, resulting from forgetting one of the most basic facts about how inference works. The mathematics I have just described is all correct, but there are other ways for us to change our minds, and retain our rationality.

A recent, brief discussion at another website drew my attention to an article by Eliezer Yudkowsky, in which he also argues that 0 and 1 are not probabilities. The argument is a little different: the amount of evidence (the likelihood ratio expressed in log-odds form) needed to update an intermediate probability to 0 or 1 is infinite. This infinite certainty is an absurdity, he claims, unable to be represented with real numbers, and so 0 and 1 aren't probabilities.

Yudkowsky, as many readers will know, is a widely regarded thinker and writer on the topic of applied rationality, and I can recommend his writing most highly. The overlap between his broad philosophy and mine is, I would say, very large, with the main difference that in cases where I lack mastery of the theoretical apparatus, he very often does not. Yudkowsky knows and understands the mind-projection fallacy better than the vast majority (see for example his article of the same name, and this followup), but in this instance, he seems to have forgotten it. It is essentially the same error made by all who claim that probabilities equal to zero or one should not enter one's calculations.

A little thought experiment, then, before resolving the paradox. Let H be the hypothesis that in some five-day interval, at some location on the Earth, the sun will rise on each of the five mornings. Let D represent the observation of the sun rising on the first of the mornings in question. What is P(D | HI)? I humbly submit that it is 1. Is H, therefore, not an appropriate, well-formed hypothesis? Is D not a valid observation? Evidently, if probability theory is to have any power at all, it must be capable of supporting hypotheses such as H, and data as trivial as D. It is not conceivable to have such things automatically ruled out under our epistemology.

In general, it is perfectly legal for P(D | HI) (or, for that matter, a posterior, like P(H | DI)) to be 0 or 1, but here's that basic fact about probability that we have to keep in mind: a probability can not be divorced from the model within which it is calculated. A model may imply infinite certainty, without any person ever achieving that state (which would be impossible to encode in their brain, anyway). Our notation says something very important: P(D | HI), no matter what it is, is necessarily contingent upon the conjunction HI, which obviously depends on the truth of I. This is something we can never be absolutely certain of.

The all-important "I" that forms the foundation for every Bayesian calculation is usually said to stand for 'information' - all the relevant prior knowledge we have. Unfortunately, this creates a little trap that too many fall into, which is to forget that there is another component besides information needed before "I" is fully populated. "I" could just as easily stand for 'imagination.' To get Bayes' theorem to do any useful work for us, we have to specify a theoretical framework. We have to make certain assumptions, including specification of a full set of hypotheses against which is to H compete. To arrive at a candidate set of hypotheses, we must make a leap of the imagination. There is no possible criterion for judging whether or not all our assumptions are correct, and no way to know in advance whether we have chosen the 'correct' set of hypotheses. To think otherwise is just wishful thinking.

To think that the infinite confidence implied under some "I" represents the actual infinite confidence of some physical rational agent is the mind-projection fallacy. Instead, a probability is a model of the confidence a rational agent would have if "I" was known to be true. That this confidence might need to be modelled using a non-numeric concept such as infinity is merely an uncomfortable (though often highly convenient) mathematical fact.

And now we can see how it is that we can continue to accrue knowledge under the threat of the apparent epistemological cul de sac that is P = 1 or P = 0. To liberate ourselves from the straight jacket of "I", we simply need to recognize that what we now call "I" is itself merely a hypothesis in some broader hierarchical model. This is how model checking (wielding the analytical blade of model comparison) works, which, as I pointed out before, seems philosophically unpalatable to many, yet is in fact an essential ingredient in our inferential machinery. This is how we can come to look again at our theoretical framework and say 'hold on, I should be working with a different hypothesis space.' Novel theories and scientific revolutions would be impossible without this flexibility.

Some see this need in Bayesian epistemology to make assumptions in "I" that can't be established with certainty as a severe weakness, but it isn't - at least not one that can be avoided (no matter how many black belts we hold in the ancient art of self deception). We can always extend the scope of our hypothesis space so that some of our assumptions become themselves random variables in a wider inferential context, but to have all of them take on the role of hypotheses under test would require an infinitely deep hierarchy of models. In the example above, where H was a hypothesis about the sun rising, one might argue that a more sophisticated model would account for the possibility, however small, that my sensation of the sun rising was mistaken. Indeed, this is correct, and would prevent the likelihood function going to 1. Sooner or later, though, I'm going to have to introduce a definitive statement - one that supposes something to be definitely true - in order to avoid the intractable quagmire of infinite complexity.

The early frequentists (and some still, in private communication with me), claimed that this subjectivity of Bayesian probability is its downfall, but in reality, it is impossible to learn in a vacuum. No kind of inference is possible without assumptions. Part of the beauty of Bayesian learning is that we make our assumptions explicit. The frequentists, of course, also make assumptions (see Yudkowsky, for example), but by refusing to acknowledge them, like the fabled ostrich sticking its head in the sand, they eliminate the possibility to examine whether or not they are reasonable, to understand their consequences, or to correct them when they are manifestly wrong.

Maximum Entropy

Monday, June 17, 2013

Extreme values: P = 1 and P = 0

2 comments: