Here is my version of a simple riddle posed by Jaynes in ‘Probability
Theory: The Logic of Science’:
A constant Poissonian light source emits photons at an average rate of 100 per second. The photons strike a detector with 10% efficiency. In one particular second, the detector registers 15 counts. What is the expectation for the number of photons actually emitted in that second? (Assume a detector with no dark counts and zero dead time.)
Lets denote the
number of emissions by n, the average number of
emissions (the source strength) by s, the number
of detected photons by c, and the detector efficiency by ϕ.
In case some of
the terminology in the question is unfamiliar, ‘Poissonian’ just means that the
number of events in any second conforms to a Poisson probability distribution with a mean of s:

(1) 
Make an effort to work out your best answer to the above question before reading on. Consider writing it down  feel free to record it in the comments below!
Many textbooks on statistics specify 'maximum likelihood' as the state of the art for parameter estimation, so lets see what result it produces here. The likelihood function for a particular model is the probability to obtain the observed data, assuming that model to be true:
Many textbooks on statistics specify 'maximum likelihood' as the state of the art for parameter estimation, so lets see what result it produces here. The likelihood function for a particular model is the probability to obtain the observed data, assuming that model to be true:
L(M) = P(D  M, I)  (2) 
where D is the
set of data, M is the model in question, and I is the prior information
available.
In the cases
where the prior information effectively specifies the form of the model, but
not the model parameters, then denoting the parameters of the model by θ, this equation becomes
L(θ) = P(D  θ, I)  (3) 
and we have the
problem of determining the best estimate of the actual value of the θ from the observed data, which is often assumed to be the value at
which the likelihood function is maximized:

(4) 
In the present
case, θ is a single parameter, the number of
emitted photons, n, and D is the fact that we have observed c photocounts. P(D
 θI) becomes P(cnϕs).
But if n is known, then knowledge of s is irrelevant, so P(cnϕs)
= P(cnϕ), which is given by the binomial distribution:
 × φ^{c} (1φ)^{nc}  (5) 
For ϕ = 0.1 and c = 15, then
it is easy enough to verify that the value of n that maximizes this equation is
150, which is just what many of us would expect.
Lets examine this solution,
though, by considering a different problem. Suppose I am interested in the
number of flies in Utrecht, the city I live in. I hang a fly paper in my
kitchen in order to gather some data (don’t worry, lab experiments on flies
don’t require ethical approval). After one day, there are 3 flies trapped.
Unfortunately, I don’t know the efficiency of my detector (the fly paper), but
suppose I have good reasons to accept that the average number trapped is
proportional to the total number in the population. On day 2, I count 6 new
casualties on the fly paper, and announce the astonishing result that in one
day, the number of flies in Utrecht has doubled. You’d hopefully respond that I am
bonkers  clearly the experiment is nowhere near sensitive enough to determine
such a thing. Yet this is essentially the result that we have declared above,
under the advice of those who champion the method of maximum likelihood for all
parameter estimation problems.
The following data are the result of a MonteCarlo simulation of the emitter and detector described in the original question:
The following data are the result of a MonteCarlo simulation of the emitter and detector described in the original question:
These data represent 32,000 seconds of simulated emissions. First, I generated 32000 random numbers, uniformly distributed between 0 and 1. Then, for each of these, I scanned along the Poisson distribution in equation (1), until the integrated probability first exceeded that random number. This gave the number of emitted photons, n, in each simulated second. Then, for each of these n, I generated another n uniform random numbers. For any of these random numbers less than or equal to the detector efficiency, ϕ, one detected photon was added to the count for that second. Each green dot in the graph above, therefore, displays the emissions and counts for a single second.
The data confirm
that 150 emissions in a single second is an extremely rare event (didn’t occur
once in the experiment), while 15 photocounts occurred on very many (more than
a thousand) occasions. From the graph above, it appears that the ‘middle’ of
the distribution for 15 counts corresponds to somewhere only slightly above 100
emissions. We can get a better view by taking a slice through this graph along
15 counts, and histogramming the numbers of emissions that were responsible for
those counts:
The peak of this
histogram is at 105 emitted photons, which is also the average number (rounded) of emitted photons
resulting in 15 counts. This is evidently our approximate solution.
The exact solution follows from Bayes’ theorem, and has been provided by Jaynes. The
general statement of Bayes’ theorem, again, is

(6) 
Filling in the
notation I have defined above, this becomes

(7) 
As noted, if we
know the exact number of photons emitted, then no amount of information about
the source strength can improve our ability to estimate the expected number of
counts. Also, the efficiency of the detector is irrelevant to the number of
photons emitted, so

(8) 
Two of these
distributions are specified above, in equations (1) and (5).
P(ns) and P(cnϕ)
can be combined to derive the probability distribution for the number of counts
given the source strength, which, in line with intuition, is again Poissonian:

(9) 
Combining all
these, and canceling terms, we get

(10) 
This is, yet again, a
Poisson distribution, with mean, s(1ϕ), only shifted up by c, since with no dark counts, the number emitted can not be less than
the number detected. The mean of this distribution is, therefore:
〈n〉 = c + s(1  φ)  (11) 
This gives 15 +
100×0.9 = 105 expected emissions, in the case of 15 detected photons. Counter intuitively, the 5 additional counts,
above the expected 10, are only evidence for 5 more photons emitted, above the
average.
The figure
obtained by maximum likelihood can be seen to be another instance of the
baserate fallacy: 150 is equal to the maximum of P(cnϕs), as n is varied. This is the essence of what maximum likelihood parameter
estimation does  given a model with parameter θ, the maximum likelihood estimate for θ is given by argmax[P(data  θ)]. In this case, it takes no account of the probability distribution
for the number of emissions, n. What we really
needed was the expectation (approximately the maximum) of P(n  cϕs),
which is the expectation of P(n  s)×P(c  nϕs). As with
the fly paper, the efficiency of the photo detector, (at 10%) was too low to
permit us to claim anything close to a 50% excess over the average 100
emissions per second.
Under the right circumstances, maximum
likelihood can do a very good job of parameter estimation, but we see here a
situation where it fails miserably. As with all rational procedures, its
success is determined entirely by its capability to mimic the results of Bayes’ theorem. The
answer it gave here differed from Bayes by an order of magnitude, and the
reason for that failure was the important information contained in the base
rates, P(ns). In a future post, I’ll discuss another kind of problem where
maximum likelihood can give a badly flawed impression, even when the prior
distribution conveys little or no information.
Hi Tom, this is a great post and I had fun working through the problem. I ended up plotting the log of the posterior distribution against plausible values of n. Here's the python code I used: https://gist.github.com/4464308. And here's the output with a graph: http://goo.gl/GgFVY.
ReplyDeleteIt's really cool that the posterior is just another Poisson distribution — I wouldn't have noticed that if you hadn't pointed it out. I'm not sure I really understand the intuition behind why this is so.
I do think your description of the technique of maximum likelihood is a bit of a strawman, though. I don't think any statistician, frequentist or no, would make that mistake on a problem this straightforward. What I learned as "maximum likelihood" is exactly what I ended up doing: use Bayes to write down P(nphi,c,s), ignore anything that's not going to change where the maximum of the curve will be (the 0.1^c term, e.g.), take the log so it's easier to calculate, and then run some numbers through the resulting formula and look for the maximum, which is easy since there's only one parameter.
I'd be kind of curious to see how a frequentist would attack this problem.
Thanks, Mike, I'm glad to see you had some fun with the numbers.
DeleteIts sort of of straw man, in that clearly no 'good' statistician of any persuasion is going make that mistake in this kind of ultraextreme case, but the fact remains that several highly respected texts on statistics claim quite explicitly that ML is the only technique required for parameter estimation.
Even if nobody really believes that, it still makes clear the weirdness of a methodology that needs to be patched up, ad hoc, when common sense points out that its rules won't work. And what about intermediate cases, where the breakdown is not so obvious? Better to start with and understand the true logic, then apply the approximations when suitable, rather than assume the approximations always work, with tacit acceptance that sometimes the consequences will be ridiculous.
Beyond the behaviour of reasonably competent stats experts, there is also the fact that many experimental scientists actually do make errors like this in the lab. This is not helped by a general culture that is too uncritical of certain methodological traditions.
Agree with the previous comment; an interesting post, and I've enjoyed browsing through the rest of your writing (got here by way of "Probably Overthinking It", which is why I'm so late to the party).
ReplyDeleteSo, putting my frequentist hat on, I'm concerned that the justification you use for eliminating the Poisson term from the likelihood function when calculating the MLE might not be quite correct, and this might be why you end up giving the ML method a failing grade, when it doesn't deserve it, at least this time :)
If you write out the joint likelihood, you have L(cn,p,s) = p(cn,p)*p(ns). I completely agree that  if you fix a value for n  p(ns) becomes constant and no longer has anything interesting to contribute to the optimization. However, in that case, n must also be fixed in p(cn,p), since it is the same term, meaning you can no longer maximize your overall likelihood with respect to it; the only term that you could estimate in that case (assuming c is also observed) would be p, if you didn't already know it.
In other words, p is independent of s given n, but n can never be independent of s since both n and s appear together in the p(ns) factor. This means that you can't drop the Poisson prior from your estimation problem when you're calculating the MLE for n.
I was too lazy to do it symbolically, so I just bruteforced the analysis. If you retain the Poisson term, both n=104 and n=105 have the same likelihood. This seems consistent with your Bayesian analysis, and certainly nowhere near 150 (which, if an analysis did lead to that result, would I agree be plainly silly).
There's certainly several good reasons to still prefer a Bayesian approach (e.g. not needing to invoke asymptotic normality results to obtain a good credible set for n), and certainly other places where MLEs give odd results, but I believe that the MLE for n in this problem is split between 104 and 105 rather than 150, since dropping the Poisson term is not permissible given the independence structure of the problem.
The python code I used to calculate the (log) MLE (less a few multiplicative constant terms) is below, if you are interested, or if you can see where I may have made a mistake.
>>> import scipy.misc as sm
>>> import numpy as np
>>> lik = lambda x:(1. / sm.factorial(x15.)) * (0.9 ** (x  15.)) * (100.**x)
>>> print sorted([(x, np.log(lik(x))) for x in range(50,120)], key = lambda x:x[1])
[ ... many omitted results ... (104, 155.90778349933589), (105, 155.90778349933589)]
Thanks for dropping by.
DeleteI am pleased when people want to interact with the blog, but what you have done is the bayesian calculation (look at eq 8). There are a few errors in your reasoning, but your factorization of the likelihood function is the crucial one. The quickest way to see this is that if your algebra is correct, then the bayesian calculation has to be wrong, as I would be missing another factor of P(ns) (which would be very weird indeed).
Cheers for the response. Forgive me if I walk through this in some detail, it's been a few years since I've studied this, so it's more for my own benefit than anything else :)
DeleteYou can actually take your equation 7 and turn it into my factorization. Starting with your equation 7, using slightly different notation since I can't figure out out to write a phi character:
p(np,c,s) = p(np,s) * p(cn,p,s) / p(cs,p)
and multiply through by p(cs,p) to get:
p(np,c,s)*p(cs,p) = p(ns,p) * p(cn,p,s)
By the rules of conditional probability  p(a,b) = p(ab)*p(b)  we can combine the lefthand side to get:
p(n,cs,p) = p(ns,p) * p(cn,p,s)
at which point we note that n is completely independent of p (since our emitter doesn't care about the detector) and that c is independent of s conditional on n (for the reasons you give), and we get my factorization of the likelihood:
p(n,cs,p) = p(cn,p)*p(ns)
correct? I don't think there's anything particularly frequentist or Bayesian at this point, it's just some algebra. At this point, c is fixed (or observed, anyway), as are s and p, so our likelihood for n becomes L(nc,s,p)=p(cn,p)*p(ns)
Where I believe you go astray in your frequentist analysis is when you use the fact that random variable c is conditionally independent of the parameter s to treat the factor p(cn,p) as though it were completely independent of p(ns) factor; n is precisely the term you are trying to 'twiddle' to maximize your likelihood. Different values of n produce different values of both p(ns) and p(cn,p), and so you cannot remove p(ns) from your optimization without biasing your answer.
If you want to argue that incorporating the prior into the likelihood is what makes it a "Bayesian" analysis, then I think we're off to a different (semantic) discussion, which probably won't be terribly fruitful :)
In any event, even with my frequentist hat on, I completely agree that it would be incredibly foolish to ignore the prior information on n given by p(ns); I'd just go one step farther and say that a valid frequentist analysis that pays attention to the dependence structure of the problem *can't* ignore it.
The likelihood is indeed a function of n, but it isn't P(ncsp). It is the probability to have observed c counts given n emissions, along with everything else, we know), P(cnsp).
Delete