Maximum Entropy: Fly papers and photon detectors: another base-rate fallacy

Tuesday, April 17, 2012

Fly papers and photon detectors: another base-rate fallacy

Here is my version of a simple riddle posed by Jaynes in ‘Probability Theory: The Logic of Science’:

A constant Poissonian light source emits photons at an average rate of 100 per second. The photons strike a detector with 10% efficiency. In one particular second, the detector registers 15 counts. What is the expectation for the number of photons actually emitted in that second? (Assume a detector with no dark counts and zero dead time.)

Lets denote the number of emissions by n, the average number of emissions (the source strength) by s, the number of detected photons by c, and the detector efficiency by ϕ.

In case some of the terminology in the question is unfamiliar, ‘Poissonian’ just means that the number of events in any second conforms to a Poisson probability distribution with a mean of s:

P(n \| s) = e^-s	sⁿ
	n!

(1)

Make an effort to work out your best answer to the above question before reading on. Consider writing it down - feel free to record it in the comments below!

Many textbooks on statistics specify 'maximum likelihood' as the state of the art for parameter estimation, so lets see what result it produces here. The likelihood function for a particular model is the probability to obtain the observed data, assuming that model to be true:

L(M) = P(D | M, I)

(2)

where D is the set of data, M is the model in question, and I is the prior information available.

In the cases where the prior information effectively specifies the form of the model, but not the model parameters, then denoting the parameters of the model by θ, this equation becomes

L(θ) = P(D | θ, I)

(3)

and we have the problem of determining the best estimate of the actual value of the θ from the observed data, which is often assumed to be the value at which the likelihood function is maximized:

^
θ	= argmax(L(θ \| D))

(4)

In the present case, θ is a single parameter, the number of emitted photons, n, and D is the fact that we have observed c photocounts. P(D | θI) becomes P(c|nϕs). But if n is known, then knowledge of s is irrelevant, so P(c|nϕs) = P(c|nϕ), which is given by the binomial distribution:

P(c \| nφ) =	n!
	c!(n - c)!

× φ^c (1-φ)^n-c

(5)

For ϕ = 0.1 and c = 15, then it is easy enough to verify that the value of n that maximizes this equation is 150, which is just what many of us would expect.

Lets examine this solution, though, by considering a different problem. Suppose I am interested in the number of flies in Utrecht, the city I live in. I hang a fly paper in my kitchen in order to gather some data (don’t worry, lab experiments on flies don’t require ethical approval). After one day, there are 3 flies trapped. Unfortunately, I don’t know the efficiency of my detector (the fly paper), but suppose I have good reasons to accept that the average number trapped is proportional to the total number in the population. On day 2, I count 6 new casualties on the fly paper, and announce the astonishing result that in one day, the number of flies in Utrecht has doubled. You’d hopefully respond that I am bonkers - clearly the experiment is nowhere near sensitive enough to determine such a thing. Yet this is essentially the result that we have declared above, under the advice of those who champion the method of maximum likelihood for all parameter estimation problems.

The following data are the result of a Monte-Carlo simulation of the emitter and detector described in the original question:

These data represent 32,000 seconds of simulated emissions. First, I generated 32000 random numbers, uniformly distributed between 0 and 1. Then, for each of these, I scanned along the Poisson distribution in equation (1), until the integrated probability first exceeded that random number. This gave the number of emitted photons, n, in each simulated second. Then, for each of these n, I generated another n uniform random numbers. For any of these random numbers less than or equal to the detector efficiency, ϕ, one detected photon was added to the count for that second. Each green dot in the graph above, therefore, displays the emissions and counts for a single second.

The data confirm that 150 emissions in a single second is an extremely rare event (didn’t occur once in the experiment), while 15 photo-counts occurred on very many (more than a thousand) occasions. From the graph above, it appears that the ‘middle’ of the distribution for 15 counts corresponds to somewhere only slightly above 100 emissions. We can get a better view by taking a slice through this graph along 15 counts, and histogramming the numbers of emissions that were responsible for those counts:

The peak of this histogram is at 105 emitted photons, which is also the average number (rounded) of emitted photons resulting in 15 counts. This is evidently our approximate solution.

The exact solution follows from Bayes’ theorem, and has been provided by Jaynes. The general statement of Bayes’ theorem, again, is

P(A \| BI) = P(A \| I)	P(B \| AI)
	P(B \| I)

(6)

Filling in the notation I have defined above, this becomes

P(n \| φcs) = P(n \| φs)	P(c \| nφs)
	P(c \| φs)

(7)

As noted, if we know the exact number of photons emitted, then no amount of information about the source strength can improve our ability to estimate the expected number of counts. Also, the efficiency of the detector is irrelevant to the number of photons emitted, so

P(n \| φcs) = P(n \| s)	P(c \| nφ)
	P(c \| φs)

(8)

Two of these distributions are specified above, in equations (1) and (5).

P(n|s) and P(c|nϕ) can be combined to derive the probability distribution for the number of counts given the source strength, which, in line with intuition, is again Poissonian:

P(c \| φs) = e^-sφ	(sφ)^c
	c!

(9)

Combining all these, and canceling terms, we get

P(n \| φcs) =	e^-s(1-φ) [ s(1-φ) ]^{n - c}
	(n-c)!

(10)

This is, yet again, a Poisson distribution, with mean, s(1-ϕ), only shifted up by c, since with no dark counts, the number emitted can not be less than the number detected. The mean of this distribution is, therefore:

〈n〉 = c + s(1 - φ)

(11)

This gives 15 + 100×0.9 = 105 expected emissions, in the case of 15 detected photons. Counter intuitively, the 5 additional counts, above the expected 10, are only evidence for 5 more photons emitted, above the average.

The figure obtained by maximum likelihood can be seen to be another instance of the base-rate fallacy: 150 is equal to the maximum of P(c|nϕs), as n is varied. This is the essence of what maximum likelihood parameter estimation does - given a model with parameter θ, the maximum likelihood estimate for θ is given by argmax[P(data | θ)]. In this case, it takes no account of the probability distribution for the number of emissions, n. What we really needed was the expectation (approximately the maximum) of P(n | cϕs), which is the expectation of P(n | s)×P(c | nϕs). As with the fly paper, the efficiency of the photo detector, (at 10%) was too low to permit us to claim anything close to a 50% excess over the average 100 emissions per second.

Under the right circumstances, maximum likelihood can do a very good job of parameter estimation, but we see here a situation where it fails miserably. As with all rational procedures, its success is determined entirely by its capability to mimic the results of Bayes’ theorem. The answer it gave here differed from Bayes by an order of magnitude, and the reason for that failure was the important information contained in the base rates, P(n|s). In a future post, I’ll discuss another kind of problem where maximum likelihood can give a badly flawed impression, even when the prior distribution conveys little or no information.

6 comments:

UnknownJanuary 5, 2013 at 5:58 PM
Hi Tom, this is a great post and I had fun working through the problem. I ended up plotting the log of the posterior distribution against plausible values of n. Here's the python code I used: https://gist.github.com/4464308. And here's the output with a graph: http://goo.gl/GgFVY.

It's really cool that the posterior is just another Poisson distribution — I wouldn't have noticed that if you hadn't pointed it out. I'm not sure I really understand the intuition behind why this is so.

I do think your description of the technique of maximum likelihood is a bit of a strawman, though. I don't think any statistician, frequentist or no, would make that mistake on a problem this straightforward. What I learned as "maximum likelihood" is exactly what I ended up doing: use Bayes to write down P(n|phi,c,s), ignore anything that's not going to change where the maximum of the curve will be (the 0.1^c term, e.g.), take the log so it's easier to calculate, and then run some numbers through the resulting formula and look for the maximum, which is easy since there's only one parameter.

I'd be kind of curious to see how a frequentist would attack this problem.
ReplyDelete
Replies
RichJuly 11, 2013 at 4:01 PM
Agree with the previous comment; an interesting post, and I've enjoyed browsing through the rest of your writing (got here by way of "Probably Overthinking It", which is why I'm so late to the party).

So, putting my frequentist hat on, I'm concerned that the justification you use for eliminating the Poisson term from the likelihood function when calculating the MLE might not be quite correct, and this might be why you end up giving the ML method a failing grade, when it doesn't deserve it, at least this time :)

If you write out the joint likelihood, you have L(c|n,p,s) = p(c|n,p)*p(n|s). I completely agree that -- if you fix a value for n -- p(n|s) becomes constant and no longer has anything interesting to contribute to the optimization. However, in that case, n must also be fixed in p(c|n,p), since it is the same term, meaning you can no longer maximize your overall likelihood with respect to it; the only term that you could estimate in that case (assuming c is also observed) would be p, if you didn't already know it.

In other words, p is independent of s given n, but n can never be independent of s since both n and s appear together in the p(n|s) factor. This means that you can't drop the Poisson prior from your estimation problem when you're calculating the MLE for n.

I was too lazy to do it symbolically, so I just brute-forced the analysis. If you retain the Poisson term, both n=104 and n=105 have the same likelihood. This seems consistent with your Bayesian analysis, and certainly nowhere near 150 (which, if an analysis did lead to that result, would I agree be plainly silly).

There's certainly several good reasons to still prefer a Bayesian approach (e.g. not needing to invoke asymptotic normality results to obtain a good credible set for n), and certainly other places where MLEs give odd results, but I believe that the MLE for n in this problem is split between 104 and 105 rather than 150, since dropping the Poisson term is not permissible given the independence structure of the problem.

The python code I used to calculate the (log) MLE (less a few multiplicative constant terms) is below, if you are interested, or if you can see where I may have made a mistake.

>>> import scipy.misc as sm
>>> import numpy as np
>>> lik = lambda x:(1. / sm.factorial(x-15.)) * (0.9 ** (x - 15.)) * (100.**x)
>>> print sorted([(x, np.log(lik(x))) for x in range(50,120)], key = lambda x:x[1])
[ ... many omitted results ... (104, 155.90778349933589), (105, 155.90778349933589)]
ReplyDelete
Replies

Add comment