Maximum Entropy: Error Bars for Binary Parameters

Propositions about real phenomena are either true or false. For some logical proposition, e.g. "there is milk in the fridge", let A be the binary parameter denoting its truth value. Now, truth values are not in the habit of marching themselves up to us and announcing their identity. In fact, for propositions about specific things in the real world, there is normally no way whatsoever to gain direct access to these truth values, and we must make do with inferences drawn from our raw experiences. We need a system, therefore, to assess the reliability of our inferences, and that system is probability theory. When we do parameter estimation, a convenient way to summarize the results of the probability calculations is the error bar, and it would seem to be necessary to have some corresponding tool to capture our degree of confidence when we estimate a binary parameter, such as A. But what could this error bar possibly look like? The hypothesis space consists of only two discrete points, and there isn't enough room to convey the required information.

Let me pose a different question: how easy is to change your mind? One of the important functions of probability theory is to quantify evidence in terms of how easy it would be for future evidence to change our minds. Suppose I stand at the side of a not-too-busy road, and wonder in which direction the next car to pass me will be travelling. Let A now represent the proposition that any particular observed vehicle is traveling to the left. Suppose that, upon my arrival at the scene, I'm in a position of extreme ignorance about the patterns of traffic on the road, and that my ignorance is best represented (for symmetry reasons) by indifference, and my resulting probability estimate for A is 50%.

Suppose that after a large number of observations in this situation, I find that almost equal numbers of vehicles have been going right as have been going left. This results in a probability assignment for A that is again 50%. Here's the curious thing, though: in my initial state of indifference, only a small number of observations would have been sufficient for me to form a strong opinion that the frequency with which A is true, f_A, is close to either 0 or 1. But now, having made a large number of observations, I have accumulated substantial evidence that f_A is in fact close to 0.5, and it would take a comparably large number of observations to convince me otherwise. The appropriate response to possible future evidence has changed considerably, but I used the same number, 50%, to summarize my state of information. How can this be?

In fact, the solution is quite automatic. In order to calculate P(A), it is first necessary to assign a probability distribution over frequency space, P(f_A). I did this in one of my earliest bog posts, in which I solved a thinly disguised version of Laplace's sunrise problem. Lets treat this traffic problem in the same way. My starting position in the traffic problem, indifference, meant that my information about the relative frequency with which an observed vehicle travels to the left was best encoded with a prior probability distribution that is the same value at all points within the hypothesis space. Lets assume also that we start with the conviction (from whatever source) that the frequency, f_A, is constant in the long run and that consecutive events are independent. Laplace's solution (this is, yet again, identical to the sunrise problem he solved just over 200 years ago) provides a neat expression for P(A), known as the rule of succession (p is probability that next event is type X, n is number of observed occurrences of type X events, and N is total number of observed events):

(1)

but his method follows that same route I took when predicting a person's behaviour from past observations: at each possible frequency (between 0 and 1) calculate P(f_A) from Bayes' theorem, using the binomial distribution to calculate the likelihood function. The proposition A can be resolved into a set of mutually exclusive and exhaustive propositions about the frequency, f_A, giving P(A) = P(A[f₁ + f₂ + f₃ +....]), so that the product rule, applied directly after the extended sum rule means that the final assignment of P(A) consists of integrating over the product f_A×P(f_A), which we recognize as obtaining the expectation, 〈f_A〉.

The figure below depicts the evolution of the distribution, P(f_A| DI), for the first N observations, for several N. The data all come from a single sequence of binary uniform random variables, and the procedure follows equation (4), from my earlier article. We started, at N = 0, from indifference, and the distribution was flat. Gradually, as more and more data was added, a peak emerged, and got steadily sharper and sharper:

(The numbers on the y-axis are larger than 1, but that's OK because they are probability densities - once the curve is integrated, which involves multiplying each value by a differential element, df, the result is exactly 1.) The probability distribution, P(f_A| DI), is therefore the answer to our initial question: P(f_A| DI) contains all the information we have about the robustness of P(A) against new evidence, and we get our error bar by somehow characterizing the width of P(f_A| DI).

Now, an important principle of probability theory requires that the order with which we incorporate different elements of the data, D, does not affect the final posterior supplied by Bayes' theorem. For D = {d₁, d₂, d₃, ...}, we could work the initial prior over to a posterior using only d₁, then, using this posterior as the new prior, repeat for d₂, and so on through the list. We could do the same thing, only taking the d's in any order we choose. We could bundle them into sub-units, or we could process the whole damn lot in a single batch. The final probability assignment must be the same in each case. Violation of this principle would invalidate our theory (assuming there is no causal path, e.g. if I'm observing my own mental state, from knowledge of some of the d's to subsequent observed d's).

For example, each curve on the graph above shows the result from a single application of Bayes' theorem, though I could just as well have processed each individual observation separately, producing the same result. This works because the prior distribution is changing with each new bit of data added, gradually recording the combined effect of all the evidence. Each d_i becomes subsumed into the background information, I, before the next one is treated.

But we might have the feeling that something peculiar happens if we try to carry this principle over to the calculation of P(A | DI). What is the result of observing 9 consecutive cars travelling to the left? It depends what has happened before, obviously. Suppose D₁ is now the result of 1 million observations, consisting of exactly 500,000 vehicles moving in each direction. The posterior assignment is almost exactly 50%. Now I see D₂, those 9 cars travelling to the left - what is the outcome? The new prior is 50%, the same as it was before the first observation.

What the hell is going on here? How do we account for the fact that these 9 vehicles have a much weaker effect on our rational belief now, than they would have done if they had arrived right at the beginning of the experiment? The outcome of Bayes' theorem is proportional to prior times likelihood: P(A | I)×P(D | AI). Looking at 2 very different situations, 9 observations after 1 million, and 9 observations after zero, the prior is the same, the proposition, A, is the same, and D is the same. The rule of succession with n = N = 9 gives the same result in each case. It seems like we have a problem. We might solve the problem by recognizing that the correct answer comes by first getting P(f_A| DI) then finding its expectation, but how did we recognize this? Is it possible that we rationally reached out to something external to probability theory to figure out that direct calculation of P(A | DI) would not work? Could it be that probability theory is not the complete description of rationality? (Whatever that means.)

Of course, such flights of fancy aren't necessary. The direct calculation of P(A | DI) works perfectly fine, as long as we follow the procedure correctly. Lets define 2 new propositions,

L = A = "the next vehicle to pass will be travelling to the left,"

R = "the next vehicle to pass will be travelling to the right."

With D₁ and D₂ as before:

D₁ = "500,000 out of 1 million vehicles were travelling to the left"

D₂ = "Additional to D₁, 9 out of 9 vehicles were travelling to the left"

Background information is given by

I₁ = "prior distribution over f is uniform, f is constant in the long run,
and subsequent events are independent"

From this we have the first posterior,

P(L | D₁I₁) = 0.5

(2)

Now comes the crucial step, we must fully incorporate the information in D₁

I₂ = I₁D₁

(3)

Now, after obtaining D₂, the posterior for L becomes

(4)

When we pose and solve a problem that's explicitly about the frequency, f, of the data-generating process, we often don't pay much heed to the updating of I in equation (3), because it is mathematically irrelevant to the likelihood, P(D₂ | fD₁I₁). Assuming a particular value for the frequency renders all the information in D₁ powerless to influence this number. But if we are being strict, we must make this substitution, as I is necessarily defined as all the information we have relevant to the problem, apart from the current batch of data (D₂, in this case).

The priors in equation (4) are equal, so they cancel out. The likelihood is not hard to calculate, remember what it means: the probability to see 9 out of 9 travelling to the left, given that 500,000 out of 1,000,000 were travelling to the left, previously, and given that the next one will be travelling to the left. That is, what is the probability to have 9 out of 9 travelling to the left, given that in total n = 500,001 out of N = 1,000,001 travel to the left. We can use the same procedure as before to calculate the probability distribution over the possible frequencies, P(f | LI₂). For any given frequency, the assumption of independence in I₁ means that the only information we have about the probability for any given vehicle's direction is this frequency, and so the probability and the frequency have the same numerical value. This means that for any assumed frequency, the probability to have 9 in a row going to the left is f ⁹, from the product rule. But since we have a probability distribution over a range of frequencies, we take the expectation by integrating over the product P(f)×f ⁹.

We can do that integration numerically, and we get a small number: 0.00195321. The counter-part of the likelihood, the one conditioned on R rather than L, is obtained by an analogous process. It produces another small, but very similar number: 0.00195318. From these numbers, the ratio in equation (4) gives 0.5000045, which does not radically disagree with the 0.5000005 we already had. (For comparison, if N = n = 9 was the complete data set, the result would be P(L) = 0.9091, as you can easily confirm.) Thus, when we do the calculation properly, a sample of only 9 makes almost no difference after a sample of 1 million, and peace can be restored in the cosmos.

Using the same procedure, we can confirm also that combining D₁ and D₂ into a single data set, with N = 1,000,009 and n = 500,009, gives precisely the same outcome for P(L | DI), 0.5000045, exactly as it must.

Maximum Entropy

Saturday, October 5, 2013

Error Bars for Binary Parameters

No comments:

Post a Comment