Friday, December 21, 2012


I often marvel at the achievements of the early scientific pioneers, the Galileos, the Newtons, and the like. Their degree of understanding would have been extraordinary under any circumstances, but as if it wasn't hard enough, they had almost no technical vocabulary to build their ideas from. They had to develop the vocabulary themselves. How did they even know what to think, without a theoretical framework already in place? Amazing. But at other times, I wonder if their situation was not one of incredible intellectual liberty, almost entirely unchained by technical jargon and untrammelled by rigorous notation. Perhaps it was a slight advantage for them, not to have those vast regions of concept space effectively cut off from possible exploration by the focusing effects of a mature scientific language. Standardized scientific language may or may not limit the ease with which novel ideas are explored, but I think there are strong grounds for believing that jargon can actively inhibit comprehension of communicated ideas, as I now want to explore.

Its certainly true that beyond a certain elementary point, scientific progress, or any kind of intellectual advance, is severely hindered without the existence of a robust technical vocabulary, but we should not conflate the proliferation of jargon with the advance of understanding. Standardized terminology is vital for ‘high-level’ thought and debate, but all too often, we seem to see this terminology as an indicator of technical progress or sophisticated thought, when it is the content of ideas we should be examining for such indications.

There is a common con trick, one is almost expected to use in order to advance one’s self, which consists of enhancing credibility by expanding the number of words one uses and the complexity of the phrases they are fitted into. It seems as though one is trying to create the illusion of intellectual rigour and content, and perhaps it’s not a bad guess to suggest that jargon proliferates most wildly where intellectual rigour is least supported by the content of the ideas being expressed. Richard Dawkins relates somewhere (possibly in ‘Unweaving the rainbow’) a story of a post-modernist philosopher who gave a talk, and in reply to a questioner who said that he wasn’t able to understand some point, said ‘oh, thank you very much.’ It suggests that the content of the idea was not important, otherwise the speaker would certainly have been unhappy that it was not understandable. Instead, it was the level of difficulty of the language that gave the talk its merit.

It has been shown experimentally that adding vacuous additional words can have a powerful psychological effect. Ellen Langer’s famous study1, for example, consisted of approaching people in the middle of a photocopying job, and asking to butt in. If the experimenter (blinded to the purpose of the experiment) said “Excuse me, I have 5 pages. May I use the xerox machine?,” a modest majority of people let her (60%), but if she said “Excuse me, I have 5 pages. May I use the xerox machine, because I have to make copies?” the number of people persuaded to step aside was much greater (93%). This shows clearly how words that add zero information can greatly enhance credibility - an effect that is exploited much too often, and not just by charmers, business people, sports commentators, and post-modernists, but by scientists as well. The other day I was reading an academic article on hyperspectral imaging, a phrase that made me uneasy - I wondered what it was - until I realised that ‘hyperspectral imaging’ is exactly that same thing as, yup, ‘spectral imaging.’

Even if we have excised the redundancy from jargon-rich language, I often suspect that technical jargon can actually impede understanding. Just as unnecessary multiplicity of terms can enhance credibility at the photocopier, I suspect that recognition of familiar jargon gives one an easy feeling which is too often confused for comprehension. You can test this with skilled scientists, by tinkering just a little bit with their beloved terminology, and observing their often blank or slightly panicked expressions. Once, when preparing a manuscript on the lifetimes of charged particles in semiconductors (the lifetime is similar to the half life in radioactivity), in one place I substituted ‘lifetime’ with the phrase ‘survival time.’ When I showed the text to a close colleague (and far better experimentalist than me) for comments, he was very uncomfortable with this tiny change. He seemed unable to relate this new phrase to his established technical lexicon.

You might think that this uneasiness is due to the need for each scientific term to be rigorously defined and used precisely, but its not. Scientists mix up their jargon all the time quite freely, and without anybody batting an eyelid most of the time. I have read, for example, an extremely technical textbook in which an expert author copiously uses the term ‘cross-section’ (something related to a particle’s interactability, and necessarily with units of area) in place of frequency, reaction probability, lifetime, mean free path, and a whole host of concepts, all somewhat related to the tendency of a pair of particles to bump into each other. Nobody minds (except for grumpy arses like me), simply because the word is familiar in the context.

Tversky and Kahneman have provided what I interpret as strong experimental evidence2 for my theory that jargon substitutes familiarity for comprehension. Two groups of study participants were asked to estimate a couple of probable outcomes from some imaginary health survey. One group was asked two questions in the form ‘what percentage of survey participants do you think had had heart attacks?’ and ‘what percentage of the survey participants were over 55 and had had heart attacks?’ By simple logic, the latter percentage can not be larger than the first as ‘over 55 and has had a heart attack’ is a subset of ’has had a heart attack,’ but 65% of subjects estimated the latter percentage as the larger. This is called the conjunction fallacy. Apparently, the greater detail, all parts of which sit comfortably together, creates a false sense of psychological coherence that messes with our ability gauge probabilities properly.

The other group was asked the same questions but worded differently: ‘out of a hundred survey participants, how many do you think had had heart attacks, how many do think were over 55 and had had heart attacks?’ Subjects in the second group turned out to be much less likely to commit the conjunction fallacy, only 25% this time. This seems to me to show that many people can comfortably use a technical word, such as ‘percentage’,  almost every day, without ever forming a clear idea in their heads of what it means. If the people asked to think in terms of percentages had properly examined the meaning of the word, they would have necessarily found themselves answering exactly the same question as the subjects in the other group, and there should have been no difference between the two groups’ abilities to reason correctly. Having this familiar word, ‘percentage,’ which everyone recognizes instantly, seems to stand in the way of a full comprehension of the question being asked. Over reliance on technical jargon actually does impede understanding of technical concepts. This seems to be particularly true when familiar abstract ideas are not deliberately translated into the concrete realm.

When I read a piece of technical literature, I have a deliberate policy with regard to jargon that greatly enhances my comprehension. As with the ‘hyperspectral imaging’ example, redundancy upsets me, so I mentally remove it, allowing myself to focus on the actual (uncrowded) information content. In this case, I actually had to perform a quick internet search to convince myself that the ‘hyper’ bit really was just hype, before I could comfortably continue reading. Once all the unnecessary words have been removed, I typically reread each difficult or important sentence, with technical terms mentally replaced with synonyms. This forces me to think beyond the mere recognition of beguiling catchphrases, and coerces an explicit relation of the abstract to the real. Its only after I can make sense of the text with the jargon tinkered with in this way that I feel my understanding is at an acceptable level. And if I can’t understand it after this exercise, then I have the advantage of knowing it.

For writers, I wonder if there is some profit to be had, in terms of depth of appreciation, by occasionally using terms that are unfamiliar in the given context. The odd wacky metaphor might be just the thing fire up the reader's sparkle circuits.

[1] The Mindlessness of Ostensibly Thoughtful Action: The Role of "Placebic" Information in Interpersonal Interaction, Langer E., Blank A., and Chanowitz B., Journal of Personality and Social Psychology, 1978, Vol. 36, No. 6, Pages 635-42 (Sorry, the link is paywalled.)

[2] Extension versus intuitive reasoning: The conjunction fallacy in probability judgment, Tversky, A., and Kahneman, D., Psychological Review, 1983, Vol. 90, No. 4, Pages 293–315

Monday, December 10, 2012

The Regression Fallacy

Consider a teacher, keen to apply rational techniques to maximize the effectiveness of his didactic program. For some time, he has been gathering data on the outcomes of certain stimuli aimed at improving his pupil's performance. He has been punishing those students that under-perform in discrete tasks, and rewarding those that excel. The results show, unexpectedly, that performances were improved, on average, only for pupils that received punishments, while those that were rewarded did worse subsequently. The teacher is seriously considering desisting from future rewards, and continuing only with punishments. What would be your advice to him?

First note that a pupil's performance in any given task will have some significant random component. Luck in knowing a particular topic very well, mood at the time of execution of the task, degree of tiredness, pre-occupation with something else, or other haphazard effects could conspire to affect the student's performance. The second thing to note is this: if a random variable is sampled twice, and the first case is far from average, then the second is most likely to be closer to the average. Neglect of this simple fact is common, and is a special case of the regression fallacy.

If a pupil achieves an outstanding result in some test, then probably this as partly due to the quality of the student, and partly due to random factors. It is also most likely that the random factors contributed positively to the result. So a particular sample of this random variable has produced a result far into the right-hand tail of its probability distribution. The odds of a subsequent sample from the same probability distribution being lower than the first are clearly the ratio of the areas of the distribution, either side of the initial value. These odds are relatively high.

Imagine a random number generator that produces an integer from 1 to 100 inclusive, all equally probable. Suppose that in the first of two draws, the number 90 comes up. There are now 89 ways to get a smaller number on the second draw, and only 10 ways to get a larger number. In a very similar way, a student who performs very well in a test (an therefore receives a reward) has the odds stacked against them, if they hope to score better in the next test. The regression fallacy, in this case, is to assume that the administered reward is the cause of the eventual decline in performance.

The argument works exactly the same way for a poorly performing pupil - a really bad outcome is most likely, by chance alone, to be followed by an improvement. This tendency for extreme results to be followed by more ordinary results is called regression to the mean. It is not impossible that an intervention such as a punishment could cause improved future performance, but the automatic assumption that an observed improvement is caused by the administered punishment is fallacious.

Another common example comes from medical science. Its when my sinusitis is worst that I sleep with a freshly severed rabbit's foot under my pillow. I almost always feel better the next morning.

These before-after scenarios are a special case, as I mentioned. In general, all we need in order to see regression to the mean is to sample two correlated random variables. They may be from the same distribution (before-after), or they may be from different distributions.

If I tell you that Hans is 6 foot, 4 inches tall (193 cm), and ask you what you expect to be the most likely height of his fully-grown son, Ezekiel, you might correctly reason that father's heights and son's heights are correlated. You might think, therefore, that the best guess for Ezekiel's height is also 6', 4", but you would be forgetting about regression to the mean - Ezekiel's height is actually most likely to be closer to average. This is because the correlation between father's and son's heights is not perfect. On a scale where 0 represents no correlation whatsoever, and 1 indicates perfect correlation (knowledge of one necessarily fixes the other precisely1), the correlation coefficient for father-son stature is about 0.5. Reasoning informally, therefore, we might adjust our estimate for Ezekiel's still unknown height to half way between 6' 4'' and the population average. It turns out we'd be bang on with this estimate. (I don't mean, of course, that this is guaranteed to be the fellow's height, but that this would be our best possible guess, most likely to be near his actual height.)

The success of this simple revision follows from the normal (Gaussian) probability distribution for people's heights. The normal distribution can be applied in a great many circumstances, both for physical reasons (central limit theorem), and for the reason that we often lack any information required to assign a more complicated distribution (maximum entropy). If two variables, x and y are each assigned a normal distribution (with respective means and standard deviations μi and σi), and provided that certain not-too-exclusive conditions are met (all linear combinations of x and y are also normal), then their joint distribution, P(xy | I), follows the bivariate normal distribution, which I won't type out, but follow the link if you'd like to see it. (As usual, I is our background information.) To get the conditional probability for y, given a known value of x, we can make use of the product rule , to give

P(x | I) is the marginal distribution for x, just the familiar normal distribution for a single variable. If one goes through the slightly awkward algebra, it is found that for xy bivariate normal, y|x is also normally distributed2, with mean

and standard deviation

where ρ is the correlation coefficient, given by

Knowing this mean and standard deviation, we can now make a good estimate of how much regression to the mean to expect in any given situation. We can state our best guess for y and its error bar.

We can rearrange equation (2) to give

which says that the expected number of standard deviations between y - given our information about x - and μy (the mean of y when nothing is known about about x) is the same as the number of standard deviations between the observed value of x and μx, only multiplied by the correlation coefficient. A bit of a mouthful, perhaps, but actually a fairly easy estimate to perform, even in an informal context. In fact, this is just what we did when estimating Ezekiel's height.

When reasoning informally, we can ask the simplified question, 'what value of y is roughly equally as improbable as the known value of x?' The human mind is actually not bad at performing such estimates. Next, we need to figure out how far it is from the expected value of y (μy) and reduce that distance from μy by the fraction ρ, which again (with a bit of practice, perhaps), we can also estimate not too badly. In any case, an internet search will often be as complicated as any data-gathering exercise needed to calculate ρ more accurately.

Here are a few correlation coefficients for some familiar phenomena:

Life expectancy by nation (data from Wikipedia here and here):

life expectancy vs GDP per capita: 0.53
male life expectancy vs female life expectancy: 0.98

IQ scores (data from Wikipedia again):

same person tested twice: 0.95
identical twins raised together: 0.86
identical twins raised separately: 0.76
unrelated children raised together: 0.3

Amount of rainfall (in US) and frequency of 'flea' searches on Google: 0.87
(from Google Correlate)

With the above procedure for estimating y|x, we can get better, more rational estimates for a whole host of important things: how will our company perform this year, given our profits last year? How will our company perform if we hire this new manager, given how his previous company performed? What are my shares going to do next? What will the weather do tomorrow? How significant is this person's psychological assessment? Or criminal record?

In summary, when two partially correlated, random variables are sampled, there is a tendency for an extreme value of one to be accompanied by a less extreme value for the other. This is simple to the point of tautology, and is termed regression to the mean. The regression fallacy is a failure to account for this effect when making predictions, or investigating causation. One common form is the erroneous assumption of cause and effect in 'before-after' type experiments. Rabbits' feet do not cure sinusitis (in case you were still wondering). Another kind of fallacious reasoning is the failure to regress to the mean an estimate or prediction of one variable based on another known fact. For two normally distributed, correlated variables, the ratio of the expected distance (in standard deviations) of one variable from its marginal mean to the actual distance of the other from its mean is the correlation coefficient.

[1] Note: there are also cases where this condition holds for zero correlation, i.e. situations where y is completely determined by x, even though their correlation coefficient is zero. Lack of correlation can not be taken to imply independence, though if x and y are jointly (bivariate) normal, lack of correlation does strictly imply independence.

[2] I've been a little cavalier with the notation, but you can just read y|x as 'the value of y, given x.' Here, y is to be understood as a number, not a proposition.

Saturday, October 27, 2012

Parameter Estimation and the Relativity of Wrong

Its not enough for a theory of epistemology to consist of mathematically valid, yet totally abstract theorems. If the theory is to be taken seriously, there has to be a demonstrable correspondence between those theorems and the real world - the theory must make sense. It has to feel right.

This idea has been captured by Jaynes in his exposition of and development upon Cox's theorems1, in which the sum and product rules of probability theory (formerly, and often still, considered as axioms themselves) were rigorously derived. Jaynes built up the derivation from a small set of basic principles, which he called desiderata, rather than axioms, among which was the requirement quite simply for 'qualitative correspondence with common sense.' If you read Cox, I think it is clear that this is very much in line with his original reasoning.

Feeling right, and a strong overlap with intuition are therefore crucial tests of the validity and consistency of our theory of probability (in fact, of any theory of probability), which, let me reiterate, is really the theory of all science.  This is one of the reasons why a query I received recently from reader Yair is a great question, and a really important one, worthy of a full blown post (this one) to explore. This excellent question was about one theory being considered closer to the truth than another, and was phrased in terms of the example of the shape of the Earth: if the Earth is neither flat nor spherical, where does the idea come from that one of these hypotheses is closer to the truth? They are both false after all, within the Boolean logic of our Bayesian system. How can Bayes' theorem replicate this idea (of one false proposition being more correct than another false proposition), as any serious theory of science surely ought to?

Discussing this issue briefly in the comments following my previous post was an important lesson for me. The answer was something that I thought was obvious, but Yair's question reminded me that at some point in the past I had also considered the matter, and expended quite some effort getting to grips with it. Like so many things, it is only obvious after you have seen it. There is a story about the great mathematician, G. H. Hardy: Hardy was on stage at some conference of mathematics giving a talk. At some point, when he was saying 'it is trivial to show that.....,' he ground to a halt, stared at his notes for a moment, scratched his head, then walked off the stage absent-mindedly, into another room. Because of his greatness, and the respect the conference attendees had for him, they all waited patiently. He returned after half an hour, to say 'yes, it is trivial,' before continuing with the rest of his talk, exactly as planned.

Yair's question also reminded me of an excellent little something I read by Isaac Asimov concerning the exact same issue of the shape of the Earth, and degrees of wrongness. This piece is called 'The Relativity of Wrong2.' It consists of a reply to a correspondent who expressed the opinion that since all scientific theories are ultimately replaced by newer theories, then they are all demonstrably wrong, and since all theories are wrong, any claim of progress in science must be a fantasy. Asimov did a great job of demonstrating that this opinion is absurd, but he did not point out the specific fallacy committed by his correspondent. In a moment, I'll redress this minor shortcoming, but first, I'll give a bit of detail concerning the machinery with which Bayes' theorem sets about assessing the relative wrongness of a theory.

The technique we are concerned with is model comparison, which I have introduced already. To perform model comparison, however, we need to grasp parameter estimation, which I probably ought to have discussed in more detail before now.

Suppose we are fitting some curve through a series of measured data points, D (e.g. fitting a straight line or a circle to the outline of the Earth), then in general, our fitting model will involve some list of model parameters, which we'll call θ. If the model is represented by the proposition, M, and I represents our background information, as usual, then the probability for any given set of numerical values for the parameters, θ, is given by


If the model has only one parameter, then this is simple to interpret: θ is just a single number. If the model has two parameters, then the probability distribution P(θ) ranges over two dimensions, and is still quite easy to visualize. For more parameters, we just add more dimensions - harder to visualize, but the maths doesn't change.

The term P(θ | MI) is the prior probability for some specific value of the model parameters, our degree of belief before the data were obtained. There are various ways we could arrive at this prior, including ignorance, measured frequencies, and a previous use of Bayes' theorem.

The term P(D | θMI), known as the likelihood function, needs to be calculated from some sampling distribution. I'll describe how this is most often done. Assuming the correctness of θMI, then we know exactly the path traversed by the model curve. Very naively, we'd think that each data point, di, in D must lie on this curve, but of course, there is some measurement error involved: the d's should be close to the model curve, but will not typically lie exactly on it. Small errors will be more probable than large errors. The probability for each di, therefore, is the probability associated with the discrepancy between the data point and the expected curve, di - y(xi), where y(x) is the value of the theoretical model curve at the relevant location. This difference, di - y(xi), is called a residual.

Very often, it will be highly justified to assume a Gaussian distribution for the sampling distribution of these errors. There are two reasons for this. One is that the actual frequencies of the errors are very often well approximated as Gaussian. This is due to the overlapping of numerous physical error mechanisms, and is explained by the central limit theorem (a central theorem about limits, rather than a theorem about central limits (whatever they might be)). This also explains why Francis Galton coined the term 'normal distribution' (which we ought to prefer over 'Gaussian,' as Gauss was not the discoverer (de Moivre discovered it, and Laplace popularized it, after finding a clever alternative derivation by Gauss (note to self: use fewer nested parentheses))).

The other reason the assumption of normality is legitimate is an obscure little idea called maximum entropy. If all we know about a distribution is its location and width (mean and standard deviation), then the only function we can use to describe it, without implicitly assuming more information than we have, is the Gaussian function.

Here's what the normal sampling distribution for the error at a single data point looks like:


For all n d's in D, the total probability is just the product of all these terms given by equation (2), and since ea×eb = ea+b, then


This, along with our priors, is all we typically need to perform Bayesian parameter estimation.

If we start from ignorance, or if for any other reason, the priors are uniform, then finding the most probable values for θ simply becomes a matter of maximizing the exponential function in equation (3), and the procedure reduces to the method of maximum likelihood. Because of the minus sign in the exponent, maximizing this function requires minimizing Σ[(di - y(xi))2/2σi2]. Furthermore, if the standard deviation, σ, is the same for all d, then we just have to minimize Σ[di - y(xi)]2, which is the least squares method, beloved of physicists.

Staying, for simplicity, with the assumption of a uniform prior, then it is clear that when comparing two different fitting models, the one that achieves smaller residuals will be the favoured one, according to probability theory. (See, for example, equation (4) in my article on the Ockham Factor.) P(D | θMI) is larger for the model with smaller residuals, as just described.

The whole point of this post was to figure out how to quantify closeness to truth. The residuals we've just been looking at are how wrong the model is: d, the data point is reality, y(x) is the model, the difference between them is the amount of wrongness of the model, which we wanted to quantify. And by Bayes' theorem, more wrongness leads to less probability, exactly as desired.

Within a system of only two models, 'flat Earth' v's 'spherical Earth,' there is no scope for knowing that both models are actually false, but even working with such a system, we would probably keep in mind the strong potential for a third, more accurate model  (e.g. the oblate spheroid that Asimov discussed). Such mindfulness is really a manifestation of a broader 'supermodel.' In the two-model system, 'spherical Earth' is closer to the truth because it manifests much smaller residuals. It is also closer to the truth than 'flat Earth,' even after the third model is introduced, because its residuals are still smaller than those for 'flat Earth.' 'Oblate spheroid' will be even closer to the truth in the 3 theory system, but spherical and flat will still have non-zero probability - strictly, we can not rule them out completely, thanks to the unavoidable measurement uncertainty, and so the statement that we know them to be false is not rigorously valid.

I promised earlier to identify the fallacy perpetrated by Asimov's misguided correspondent. I have already discussed it a few months ago. It is the mind-projection fallacy, the false assumption that aspects of our model of reality must be manifested in reality itself. In this case: if wrongness (relating to our knowledge of reality) can be graded, then so must 'true' and 'false' (relating to reality itself) also be graded. There are two ways to reason from here: (1) truth must be fuzzy, or (2) our idea of continuous degrees of wrong must be mistaken.

The idea that all models that are wrong are necessarily all equally wrong, as expressed in the letter with which poor Asimov was confronted, is fallacious in the extreme. Wrong does not have this black/white feature. 'Wrong' and 'false' are not the same. Of course, a wrong theory is also false, but if I'm walking to the shop, I'd rather find my location to be wrong by half a mile than by a hundred miles.

We can say that a theory is less wrong (i.e. produces smaller residuals), without implying that is is more true. 'True' and 'false' retain their black-and-white character, as I believe they must, but our knowledge of what is true is necessarily fuzzy. This is precisely why we use probabilities. As our theories get incrementally less wrong and closer to the truth, so the probabilities we are allowed to assign to them get larger.

There often seems to be a kind of bait-and-switch con trick going on with many of the world's least respectable 'philosophies.' The 'philosopher' makes a trivial but correct observation, then makes a subtle shift, often via the mind-projection fallacy, to produce an equivalent-looking statement that is both revolutionary, and utter garbage. In post-modern relativism (a popular movement in certain circles), we can see this shifting between right/wrong and true/false. The observation is made that all scientific theories are ultimately wrong, then hoping you won't notice the switch, the next thing you hear is that all theories are equally wrong. They can't seem to make their minds up, however, which side of the fallacy they are on, because the next thing you'll probably hear from them is that because knowledge is mutable, then so are the facts themselves: truth is relative to your point of view and the mood you happen to be in, science is nought but a social construct. Part of the joy of familiarity with scientific reasoning is the clarity of thought to see through the fog of such nonsense.

[1] 'Probability, Frequency, and Reasonable Expectation,' R. T. Cox, American Journal of Physics 1946, Vol. 14, No. 1, Pages 1-13. (Available here.)

[2] 'The Relativity of Wrong,' Isaac Asimov, The Skeptical Inquirer, Fall 1989, Vol. 14, No. 1, Pages 35-44. (Download the text here.) (And no, I don't find it spooky that both references are from the same volume and number.)

Monday, October 8, 2012

Total Bayesianism

If you've read even a small sample of the material I've posted so far, you'll recognize that one of my main points concerns the central importance of Bayes' theorem. You might think, though, that the most basic statement of this importance is something like "Bayes' theorem is the most logical method for all data analysis." This, for me though, falls far short of capturing the most general importance of Bayes' rule.

Bayes' theorem is more than just a method of data analysis, a means of crunching the numbers. It represents the rational basis for every aspect of scientific method. And since science is simply the methodical application of common sense, Bayes' theorem can be seen to be (together with decision theory) a good model for all rational behaviour. Indeed, it may be more appropriate to invert that, and say that your brain is a superbly adapted mechanism, evolved for the purpose of simulating the results of Bayes' theorem. 

Because I equate scientific method with all rational behaviour, I am no doubt opening myself up to the accusation of scientism1, but my honest response is: so what? If I am more explicit than some about the necessary and universal validity of science, this is only because reason has led me in this direction. For example, P.Z. Myers, author of the Pharyngula blog (vastly more well known than mine, but you probably knew that already), is one of the great contemporary advocates of scientific method - clear headed and craftsmanlike in the way he constructs his arguments - but in my evidently extreme view, even he can fall short, on occasion, of recognizing the full potential and scope of science. In one instance I recall, when the league of nitwits farted in Myers' general direction, and he himself stood accused of scientism, he deflected the accusation, claiming it was a mistake. My first thought, though, is "hold on, there's no mistake." Myers wrote: 
The charge of scientism is a common one, but it’s not right: show us a different, better path to knowledge and we’ll embrace it.
But how is one to show a better path to knowledge? In principle, it can not be done. If Mr. X claims that he can predict the future accurately by banging his head with a stone until visions appear, does that suffice as showing? Of course not, a rigorous scientific test is required. Now, if under the best possible tests, X's predictions appear to be perfectly accurate, any further inferences based on them are only rational to the extent that science is capable of furnishing us (formally, or informally) with a robust probability estimate that his statements represent the truth. Sure, we can use X's weird methodology, but we can only do so rationally, if we do so scientifically. X's head smashing trick will never be better than science (a sentence I did not anticipate writing).

To put it another way, X may yield true statements, but if we have no confidence in their truth, then they might as well be random. Science is the engine generating that confidence.

So, above I claimed that all scientific activity is ultimately driven by Bayes' theorem. Lets look at it again in all its glory:

P(H | DI)   =    P(H | I) × P(D | H I)
P(H | I) × P(D | H I)   +   P(H' | I) × P(D | H' I)

(As usual, H is a hypothesis we want to evaluate, D is some data, I is the background information, and H' means "H is not true.")

The goal of science, whether one accepts it or not, is to calculate the term on the left hand side of equation (1). Now, most, if not all, accepted elements of experimental design are actually adapted to manipulate the terms on the right hand side of this equation, in order to enhance the result. I'll illustrate with a few examples.

Firstly, and most obviously, the equation calls for data, D. We have to look at the world, in order to learn about it. We must perform experiments to probe nature's secrets. We can not make inferences about the real world by thought alone. (Some may appear to do this, but no living brain is completely devoid of stored experiences - the best philosophers are simply very efficient at applying Bayes' theorem (usually without knowing it) to produce powerful inferences from mundane and not very well controlled data. This is why philosophy should never be seen as lying outside empirical science.)

Secondly, the equation captures perfectly what we recognize as the rational course of action when evaluating a theory - we have to ask 'what should I expect to see if this theory is true? - what are its testable hypotheses?' In other words, what data can I make use of in order to calculate P(D | HI)?

Once we've figured out what kind of data we need, the next question is how much data? Bayes' rule informs us: we need P(D | HI) to be as high as possible if true, and as low as possible if false. Lets look at a numerical example:

Suppose I know, on average, how tall some species of flower gets, when I grow the plants in my home. Suppose I suspect that picking off the aphids that live on these flowers will make the plants more healthy, and cause them to grow higher. My crude hypothesis is that the relative frequency with which these specially treated flowers exceed the average height is more than 50 %. My crude data set results from growing N flowers, applying the special treatment to all of them, and recording the number, x, that exceed the known average height.

To check whether P(D | HI) is high when H is true and low when H is false, we'll take the ratio

P(D | H I)
P(D | H' I)

If Hf says that the frequency with which the flowers exceed their average height is f, then P(D | HfI) (where D is the number of tall flowers, x, and the total number grown N) is given by the binomial distribution. But our real hypothesis, H, asserts that f is in the range 0.5 < f ≤ 1. This means we're going to have to sum up a whole load of P(D | HfI)s. We could do the integral exactly, but to avoid the algebra, lets treat the smoothly varying function like a staircase, and split the f-space into 50 parts: f = 0.51, 0.52, ...,0.99, 1.0. To calculate P(D | H'I), we'll do the same, with f = 0.01, 0.02, ...., 0.50.

What we want, e. g. for the P(D | HI), is P(D | [H0.51 + H0.52 + ....] I).

Generally, where all hypotheses involved are mutually exclusive, it can be shown (see appendix below) that,

P(D | [H1 + H2 + .....] I)   =    P(H1 | I) P(D | H1 I)  +  P(H2 | I) P(D | H2 I)  +  .....
P(H1 | I)  +  P(H2 | I)  +  .....

But we're starting from ignorance, so we'll take all the priors, P(H| I), to be the same. We'll also have the same number of them, 50, in both numerator and denominator, so when we take the desired ratio, all the priors will cancel out (as will the width, Δf = 0.01, of each of the intervals on our grid), and all we need to do is sum up P(D | Hf1I) + P(D | Hf2I) + ....., for each relevant range. Each term will come straight from the binomial distribution:

P(x | N, f)  =   N!    xf   (N - x)1-f
x!   (N - x)!

If we do that for say 10 test plants, with seven flowers growing beyond average height, then ratio (2) is 7.4. If we increase the number of trials, keeping the ratio of N to x constant, what will happen?

If we try N = 20, x = 14, not too surprisingly, ratio (2) improves. The result now is 22.2, an increase of 14.8. Furthermore if we try N = 30, x = 21, ratio (2) increases again, but this time more quickly: now the ratio is 58.3, and further increase of 36.1.

So, to maximize the contrast between the hypotheses under test, H and H', what we should do is take as many measurements as practically possible. Something every scientist knows already, but something nonetheless demanded by Bayes' theorem.

How is our experimental design working out, then? Well, not that great so far, actually. Presumably the point of the experiment was to decide if removing the parasites from the flowers provided a mechanism enabling them to grow bigger, but all we have really shown is that they did grow bigger. We can show this by resolving e.g. H' into a set of mutually exclusive and exhaustive (within some limited model) sub-hypothesis:

 H' = H'A1 + H'A2 + H'A3 + ......

where H' is, as before, 'removing aphids did not improve growth,' and some of the A's represent alternative causal agencies capable of affecting a change in growth. For example, A1 is the possibility that a difference in ambient temperature tended to make the plants grow differently. Lets look again at equation (3). This time instead of Hf's, we have all the H'Ai, but the principle is the same. Previously, the priors were all the same, but this time, we can exploit the fact that they need not be. We need to manipulate those priors so that the P(D | H'I) term in the denominator of Bayes' theorem, is always low if the number of tall plants in the experiment is large. We can do this by reducing the priors for some of the Ai corresponding to the alternate causal mechanisms. To achieve this, we'll introduce a radical improvement to our methodology: control.

Instead of relying on past data for plants not treated by having their aphids removed, we'll grow 2 sets of plants, treated identically in all respects, except the one that we are investigating with our study. The temperature will be the same for both groups of plants, so P(A1 | I) will be zero - there will be no difference in temperature to possibly affect the result. The same will happen to all (if we have really controlled for all confounding variables) the other Ai that corresponded to additional agencies offering explanations for taller plants.

This process of increasing the degree of control can, of course, undergo numerous improvements. Suppose, for example, that after a number of experiments, I begin to wonder if its not actually removing the aphids that affects the plants, but simply the rubbing of the leaves with my fingers that I perform in order to squish the little parasites. So as part of my control procedure, I devise a way to rub the leaves of the plants in the untreated group, while carefully avoiding those villainous arthropods. Not a very plausible scenario, I suppose, but if we give a tentative name to this putative phenomenon, we can appreciate how analogous processes might be very important in other fields. For the sake of argument, lets call it a placebo effect.

Next I begin to worry that I might be subconsciously influencing the outcome of my experiments. Because I'm keen on the hypothesis I'm testing, (think of the agricultural benefits such knowledge could offer!) I worry that I am inadvertently biasing my seed selection, so that healthier looking seeds go into the treatment group, more than into the control group. I can fix this, however, by randomly allocating which group each seed goes into, thereby setting the prior for yet another alternate mechanism to zero. The vital nature of randomization, when available, in collecting good quality scientific data is something we noted already, when looking at Simpson's paradox, and is something that has been well appreciated for at least a hundred years.

Randomization isn't only for alleviating experimenter biases, either. Suppose that my flower pots are filled with soil by somebody else, with no interest in or knowledge of my experimental program. I might be tempted to use every second pot for the control group, but suppose my helper is also filling the pots in pairs, using one hand for each. Suppose also that the pots filled with his left hand receive inadvertently less soil than those filled with his right hand. Unexpected periodicities such as these are also taken care of by proper randomization.

Making real-world observations, and lots of them; control groups; placebo controls; and randomization: some exceedingly obvious measures, some less so, but all contained in that beautiful little theorem. Add these to our Bayesian formalization of Ockham's razor, and its extension, resulting in an explanation for the principle of falsifiability, and we can not avoid noticing that science is a thoroughly Bayesian affair.


You might like to look again at the 3 basic rules of probability theory, if your memory needs refreshing.

To derive equation (3), above, we can write down Bayes' theorem in a slightly strange way:

P(D | [H1 + H2 + ....], I)   =    P(D | I) × P([H1 + H2 + ....] | D I)
P([H1 + H2 + ....] | I)

This might look a bit backward, but thinking about it a little abstractly, before any particular meaning is attached to the symbols, we see that it is perfectly valid. If you're not used to Boolean algebra, or anything similar, let me reassure you that its perfectly fine for a combination of propositions, such as A + B + C, (where the + sign means 'or') to be treated as a proposition in its own right. If equation (A1) looks too much, just replace everything in the square brackets with another symbol, X.

As long as all the various sub-hypotheses, Hi, are mutually exclusive, then when we apply the extended sum rule above and below the line, the cross terms vanish, and (A1) becomes:

P(D | [H1 + H2 + ....], I)   =    P(D | I) × [ P(H1 | D I)  +  P(H2 | D I)  +  ..... ]
P(H1 | I)   +   P(H2 | I)   + .....

We can multiply out the top line, and also make note that for each hypothesis, Hi, we can make two separate applications of the product rule to the expression P(Hi D | I), to show that

P(D | I)   =    P(Hi | I) P(D | Hi I)
P(Hi | D I)

(This is actually exactly the technique by which Bayes' theorem itself can be derived.)

Substituting (A3) into (A2), we see that

P(D | [H1 + H2 + .....] I)   =    P(H1 | I)  P(D | H1 I)  +  P(H2 | I)  P(D | H2 I)  +  .....
P(H1 | I)  +  P(H2 | I)  +  .....

which is the result we wanted.

[1]  From Wikipedia:
Scientism is a term used, usually pejoratively, to refer to belief in the universal applicability of the scientific method and approach, and the view that empirical science constitutes the most authoritative worldview or most valuable part of human learning to the exclusion of other viewpoints.

Thursday, August 16, 2012

Bayes' Theorem: All You Need to Know About Theology

At the end of my previous post, I argued that an intuitive grasp of Bayesian model comparison is an invaluable asset to those wishing to apply scientific method. Even if one is never going to execute rigorous calculations of the form described in that article, it is still possible to gain important insight into the plausibility of different descriptions of reality, merely by exercising one’s familiarity with the general structure of the formal theory. Here, I’ll illustrate the application of this kind of informal reasoning, which, while not making use of even a single line of algebra, can still be seen to be quite water tight.

One thing I have tried to get across in my writing here is that scientific method is for everybody and can address all meaningful questions concerning fact. If a matter of fact has real consequences for the state of reality, then it is something that can be scrutinized by science. If it has no real consequences, then it is at most charitable, an extremely poor class of fact. In this article, I’ll apply probability theory to the question of whether or not the universe is the product of some omnipotent deity. It’s a lot simpler to do than you might think.

Now there are some (including, sadly, some scientists) who maintain that what I am going to do here is inappropriate and meaningless. To many of these people, reality is divided into two classes of phenomena: the natural and the supernatural. Natural phenomena, they say, are the things that fall into the scope of science, while the supernatural lies outside of science’s grasp, and can not be addressed by rational investigation. This is completely muddle-headed, as I have argued elsewhere. If we can measure it, then it falls into science’s domain. If we can’t measure it, then postulating its existence achieves nothing.

Other disguised forms of this argument exist. I was once asked by another physicist (and good friend): ‘How can science be so arrogant to think that it can address all aspects of reality?’ To which the answer is obvious: if you wish to claim that there is something real that can not be investigated scientifically, how can you be so arrogant to think that you know what it is? What could possibly be the basis for this knowledge?

As I said, addressing the existence of God with probability theory is quite simple to achieve. In fact, it is something that one of my mathematical heroes, Pierre-Simon Laplace, achieved with a single sentence, in a conversation with Napoleon I. The conversation occurred when the emperor was congratulating the scientist on his new book on celestial mechanics, and proceeded as follows:


You made the system of the world, you explain the laws of all creation, but in all your book you speak not once of the existence of God!


Sire, I had no need of that hypothesis.

Lagrange (another mathematician who was also present):

Ah, but that is such a good hypothesis. It explains so many things!


Indeed, Sire, Monsieur Lagrange has, with his usual sagacity, put his finger on the precise difficulty with the hypothesis: it explains everything, but predicts nothing.

I believe this might be the world’s earliest recorded application of Bayesian model comparison.

By the way, a quick note of thanks: when I first came across the full version of this exchange, I struggled to find strong reason to treat it as more than a legend, but historian and Bayesian,  Richard Carrier, has pointed me to sources that strongly boost the odds that this conversation was a real event. Richard also presents arguments from probability theory pertaining to religious matters. See, for example, this video.

Now, to see what Laplace was on about, we should think about model comparison in the terms that I have introduced here and discussed further in the article linked to above. Its true that Bayesian model comparison would not be formally described for more than a hundred years after Laplace’s death, but as the founder of Bayesian inference and a mathematician of extraordinary genius and natural insight, he must have been capable of perceiving the required logic. (Part of the beauty of Bayesian statistics is that hypothesis testing, parameter estimation, and model comparison are really only slightly different versions of the same problem – this gives it a logical unity and coherence that other approaches can only dream of enviously.)

To approach the problem, lets imagine a data set of just a few points – lets say 6 points, which we would like to fit with a polynomial function. The obvious first choice is to try a straight line. Illustrated below are the imagined data and the fitted straight line, which is the maximum likelihood estimate.

Because there is noise in the data, the fitted line misses all the data points, so there are some residuals associated with this fit. Is there a way to reduce the residuals, i.e. to have a model that passes closer to the measured data points? Of course there is: just increase the number of free parameters in the fitting model. In fact, with only six data points, a polynomial with terms up to and including the fifth power is already sufficient to guarantee that the residuals are reduced to exactly zero, as illustrated below, with exactly the same data as before.

Is this sufficient to make the fifth-order polynomial the more likely model? Certainly not. This more complex model has 6 fitting parameters, as opposed to only 2 for the linear fit. As I explained previously, though, each additional degree of freedom adds another dimension to the parameter sample space, which necessarily reduces the amount of prior probability for the parameters in the maximum likelihood region – the available prior probability needs to be spread much more thinly in order to cover the extended sample space. This is the penalty introduced in the form of the Ockham factor. This reduced prior probability, of course, results in a lower posterior probability for the model in most cases.

Now the hypothesis that Napoleon and Lagrange wanted Laplace to take seriously, the one about that omnipotent deity, is one with infinite degrees of freedom. That’s the definition of omnipotent: there is nothing that God can’t do if it wants to. That means infinitely many dimensions in the parameter sample space, and therefore infinitely low prior probability at all points. To see the probability that God exists vanish to zero, we only need to postulate any alternative model of reality with finite degrees of freedom. If my interpretation of Laplace’s comment is correct, this is the fact that he was able to perceive: that there is simply no amount of evidence that could raise the hypothesis of an omnipotent deity to a level of plausibility competitive with other theories of reality.

And what if we relax the requirement for God to be strictly omnipotent? It makes bugger all difference. Every prayer supposedly answered or not answered, every person saved from tragedy or not saved, every event attributed to God’s will represents another degree of freedom. That’s still a tremendous number of degrees of freedom, and while it may be finite, it’s still many orders of magnitude greater than the numbers of free parameters that most specialists would claim to be sufficient for a complete theory of the universe.

At this point, we can take note of yet another important scientific principle that can be recognized as just a special case of Bayes’ theorem, this time Karl Popper’s principal of falsifiability. Popper recognized that in order for a hypothesis to be treated as scientific, and worthy of rational investigation, it must be vulnerable to falsification. That means that a theory must be capable of making specific predictions, which, if they fail to arise in a suitable experiment, will identify the theory as false. If a theory is not falsifiable, then any data nature throws our way can be accommodated by it. This means that the theory predicts nothing whatsoever. Popper coined the term ‘pseudoscience’ for theories like this, such as psychoanalysis and astrology.

Now, if a theory is consistent with all conceivable data sets (i.e. unfalsifiable), this means that the associated model curve is capable of traversing all possible paths through the sample space for the data - just like the 5th order polynomial was able to land exactly on all 6 data points, above, regardless of where they were. Assuming that there is no limit to the number of observations we can accrue, this implies that the model has infinite degrees of freedom, which, as we have just discovered, is really bad news: thanks to the penalty introduced by the Ockham factor, this leaves you with a theory with zero credibility.

The fact that we can derive important and well-known principles of common sense and scientific methodology, such as Ockham’s razor and the principle of falsifiability, as consequences of Bayes’ theorem illustrates further what I have said above about the logical unity of this system. This is why I believe that Bayesian inference, along with the broader theory that it fits into, constitutes the most comprehensive and coherent theory of how knowledge is acquired. (I’ll get round to that broader theory some day, but it should be clear already that Bayes’ theorem is a consequence of more general principles.)

Much of my interest in science comes from deriving great pleasure from knowledge. Real respect for knowledge, however, demands an assessment of its quality. Its not enough to know that experts say that a meteor hitting the Earth killed off the dinosaurs - I want to know how convincingly that explanation stands up beside competing hypotheses. That’s why I’m interested in probability. This is the theory that permits this necessary appraisal of knowledge, the theory of how we know what we know, and how well we know it. Science is the systematic attempt to maximize the quality of our knowledge, and so probability is therefore also the underlying theory of science. 

Lets recap the main points, in terms as simple as I can manage. If I try to fit a sequence of data points with a straight line, Ax + B, then there are 2 adjustable model parameters, A and B. So the chosen parameters are represented by coordinates, (x, y), on a two-dimensional plane. If I want a more complicated model, with one more degree of freedom, then the chosen point becomes (x, y, z), in a 3D space. Each free parameter results in an additional dimension for the parameter space. In the 2D case, for example, the prior probability for the point (x, y) is the product of the individual prior probabilities: 

P(x | I) × P(y | I)

Since these prior probabilities are all less then one, then the more degrees of freedom there are, the smaller the prior probability will be for any particular point, (x, y, z, ….). If there are infinite degrees of freedom, then the prior probability associated with any point in the parameter space will be zero.

The posterior probability for a model depends strongly on this prior probability distribution over the parameter space, as shown by equation (4) in my article on the Okham factor. If the prior probabilities for the points (x, y, z, …) in the parameter space are all zero, then the probability for the model is also zero.

Any unfalsifiable theory must have infinite degrees of freedom in order to be able to remain consistent with all conceivable observations. With limited degrees of freedom, the complexity of the path traced by the model curve will also be limited, and the theory will be vulnerable to falsification – the model curve will not be guaranteed to be able to find a path that travels to each data point. Any unfalsifiable theory, therefore, has zero posterior probability. This includes the hypothesis of an omnipotent deity. Because of its unlimited powers, such an entity is capable of producing any sequence of events it chooses, meaning that we need a model curve with infinite free parameters to be guaranteed access to all data points. 

Tuesday, July 31, 2012

The Ockham Factor

In an earlier post, I described how the method of maximum likelihood can deviate radically from the more logical results obtained from Bayes’ theorem, when there is strong prior information available. At the end of that post, I promised to describe another situation where maximum likelihood can fail to capture the information content of a problem, even when the prior distribution for the competing hypotheses is relatively uninformative. Having introduced the foundations of Bayesian model comparison, here, I can now proceed to describe that situation. I’ll follow closely a line of thought developed by Jaynes, in the relevant chapter from ‘Probability theory: the logic of science.’

In parameter estimation we have a list of model parameters, θ, a dataset, D, a model M, and general information, I, from which we formulate


P(D | θMI) is termed the likelihood function, L(θ | MDI) or more briefly, L(θ). The maximum likelihood estimate for θ is denoted.

When we have a number of alternative models, denoted by different subscripts, the problem of assessing the probability that the jth model is the right one is just an equivalent one to that of parameter estimation, but carried out at a higher level:


The likelihood function here makes no assumption about the set of fitting parameters used, and so it decomposes into an integral over the entire available parameter space (extended sum rule, with P(HiHj) = 0, for all i ≠ j):


and the denominator is as usual just the summation over all n competing models, and so we get


In orthodox hypothesis testing, the weight that a hypothesis is to be given (P(H | DI) for us) is addressed with a surrogate measure: the likelihood function, which is what we would write as P(D | HI). Parameter estimation has never been considered in orthodox statistics to be remotely connected with hypothesis testing, but, since we now recognize that they are really the same problem, we can see that if the merits of two models, Mj and Mk are to be compared, orthodox statistics should do so by calculating the ratio of the likelihood functions at their points of maximum likelihood:


This, I admit, is something of a straw-man metric – I’m not aware if anybody actually uses this ratio for model comparison, but if we accept the logic of maximum likelihood, then we should accept the logic of this method. (Non-Bayesian methods that attempt to penalize a model for having excessive degrees of freedom, such as the adjusted correlation coefficient, exist, but seem to me to require additional ad hoc principles to be introduced into orthodox probability theory.) Let’s see, then, how the ratio in expression (5) compares to the Bayesian version of Ockham’s razor.

In calculating the ratio of the probabilities for two different models, the denominators will cancel out – it is the same for all models.

Lets express the numerator in a new form by defining the Ockham factor, W (for William, I guess), as follows:


which is the same as


from which it is clear that


Now we can write the ratio of the probabilities associated with the two models, Mj and Mk, in terms of expression (5):


So the orthodox estimate is multiplied by two additional factors: the prior odds ratio and the ratio of the Ockham factors. Both of these can have strong impacts on the result. The prior odds for the hypotheses under investigation can be important when we already had reason to prefer one model to the other. In parameter estimation, we’ll often see that these priors make negligible difference in cases where the data carry a lot of information. In model comparison, however, prior information of another kind can have enormous impact, even when the data are extremely informative. This is introduced by the means of the Ockham factor, which can easily overrule both the prior odds and the maximum-likelihood ratio.

In light of this insight, we should take a moment to examine what the Ockham factor represents. We can do this in the limit that the data are much more informative than the priors, and the likelihood function is sharply peaked at  (which is quite normal, especially for a well designed experiment). In this case, we can estimate the sharply peaked function L(θ) as a rectangular hyper-volume with a ‘hyper-base’ of volume V, and height . We choose the size of V such that the total enclosed likelihood is the same as the original function:


We can visualize this in one dimension, by approximating the sharply peaked function below by the adjacent square pulse, with the same area:

Since the prior probability density P(θ | MI) varies slowly over the region of maximum likelihood, we observe that


which indicates that the Ockham factor is a measure of the amount of prior probability for θ that is packed into the high-likelihood region centered on , which has been singled out by the data. We can now make more concrete how this amount of prior probability in the high likelihood region relates to the ‘simplicity’ of a model by noting again the consequences of augmenting a model with an additional free parameter. This additional parameter adds another dimension to the parameter space, which, by virtue of the fact that the total probability density must be normalized to unity, necessitates a reduction of magnitude of the peak of P(θ | MI), compared to the low-dimensional model. Lets assume that the more complex model is the same as the simple one, but with some additional terms patched in (the parameter sample space for the smaller model is embedded in that of the larger model).  Then we see that it is only if the maximum-likelihood region for the more complex model is far from that of the simpler model, that the Ockham factor has a chance to favor the more complex model.

For me, model comparison is closest to the heart of science. Ultimately, a scientist does not care much about the magnitude of some measurement error, or trivia such as the precise amount of time it takes the Earth to orbit the sun. The scientist is really much more interested in the nature of the cause and effect relationships that shape reality. Many scientists are motivated by desire to understand why nature looks the way it does, and why, for example, the universe supports the existence of beings capable of pondering such things. Model comparison lifts statistical inference beyond the dry realm of parameter estimation, to a wonderful place where we can ask: what is going on here? A formal understanding of model comparison (and a resulting intuitive appreciation) should, in my view, be something in the toolbox of every scientist, yet it is something that I have stumbled upon, much to my delight, almost by chance.