## Sunday, December 22, 2013

### Confounded Koalas

Koalas are not as exclusive as kangaroos. At least, when it comes to their drinking habits. As I explained before, kangaroos drink beer or whisky, but not both. Koalas like to mix things up a bit more, when it comes to their choice of drink, but how much exactly? What is the probability, for example, that any given koala who drinks beer on any given night will also drink whisky on the same night? These are the sorts of urgent questions that science must seek to answer with the utmost speed and accuracy.

It's an empirical question, so we'll need some empirical evidence. Luckily, I've been out in the field already. I went deep into the outback one day, and asked lots of koalas what they had been drinking the night before. To save time, though, I didn't bother questioning animals that I believed hadn't been drinking anything on the previous evening. To do this, I polled only hung-over koalas. You see, when a koala gets a hang over, its nose turns bright red, which can be seen from quite a distance. This clever strategy saved me a lot of time walking through the Australian scrub, trying to catch up with subjects who couldn't add anything to my required data set. Here are the numbers I obtained, after interviewing 1222 red-nosed koalas:

 Beer but no whisky: 505 Whisky but no beer: 436 Beer and whisky: 281

So the total number of whisky drinkers, for example, came to 59% of the 1222 study participants. Of the subset of those 1222 individuals who drank beer on the preceding evening, however, the number that also drank whisky came to only 36 %. Thus we conclude that consumption of whisky is anti-correlated with consumption of beer - if I know that an individual has consumed beer, I consider it less likely to have drunk whisky than otherwise. Letting A represent the proposition that a koala drank whisky and B stand for a beer drinker, we conclude that P(A) > P(A | B).

Ok, confession time. Brace yourself, this is going to come as quite a shock. These are completely made up data. I've never even been to Australia.

But here is the really weird thing:

The numbers above were actually produced by a randomized model that assumed a complete lack of correlation between beer drinking and whisky consumption. The two were assumed independent, meaning that in fact, P(A) = P(A | B), in contrast with the strong impression given by the generated data.

The number of koalas simulated is quite large, so this isn't a case of random noise producing a spurious finding. In fact, what we've been the victim of here is a kind of biased sampling, known as Berkson's paradox. Our attempt to investigate the relationship between two variables, A and B, has been confounded in an interesting way by a third variable, C.

A fairly trivial special case of the product rule states that (where, as always, X+Y denotes 'X or Y')

P(A.[A+B]) = P(A | A+B) × P(A+B)

and because the conjunction of 2 propositions, XY, is identical to YX, then also

P(A.[A+B]) = P(A+B | A) × P(A)

Now, A+B is a sure thing, if A is already known to be true, so combining these two results gives

P(A) = P(A | A+B) × P(A+B)

Assuming that neither P(A) nor P(B) is 0 or 1, this means that

 P(A) < P(A | A+B)
(1)

which is something we ought to have expected already - knowledge that at least one of the propositions A and B is true constitutes good evidence that A is true.

But a good way to be confident that A+B is true is if events A and B are both separately known to cause another event C, which is known to have occurred. This is what has happened on this occasion: C is the hangover I used to select subjects for the study. So while we were trying to estimate P(A), what we actually measured was more indicative of P(A | A+B), and thus our result of 59% was, from equation (1), an overestimate.

What happened when we were estimating P(A | B)?

A simple way to get to grips with this is by drawing a truth table, to compare the 2 propositions, "A or B" and "B and (A or B)":

 A B A+B B.(A+B) 0 0 0 0 0 1 1 1 1 0 1 0 1 1 1 1

From the truth table, it's quite clear that "B.(A+B)" has, under all circumstances, the same values as "B" - it is the same proposition. Thus in obtaining an estimate for P(A | B.[A+B]), which we inadvertently did, when what we wanted was P(A | B), the addition of the extra information, B, erased any effect of our prior knowledge of A+B that isn't preserved in our knowledge of just B. Therefore, P(A | B.[A+B]) = P(A | B), and our measured proportion of 36% of all beer drinkers who drink whisky did not suffer any distortion due to my chosen selection method.

Because my figure for P(A) was overestimated, however, while the result for P(A | B) was not, the effect was a spurious impression that P(A | B) < P(A), implying negative correlation, contrary to the reality of the data-generating process.

In fact, the numbers I gave above, came from 2000 simulated koalas, randomly assigned as beer drinkers with probability 0.4, and independently assigned as whisky drinkers with probability 0.36. This is perfectly reflected in the observed proportions for beer drinkers (786/2000 = 0.393) and whiskey drinkers (717/2000 = 0.359). The result for P(A | B) was also perfectly consistent with independence - the proportion of beer drinkers who indulged in whisky was the same as the proportion for the entire population, 36%.

With Berkson's paradox, our attempt to draw inferences about the relationship between two variables, A and B, was confounded by a third, correlated variable, C. Something very similar was going on, when we examined Simpson's paradox, but the effect of the confounder was slightly different. With Simpson's paradox, two non-independent variables, A and B, are rendered conditionally independent upon receipt of information concerning C (A is "screened off" from B by C, in the language of the graph theorists) - without knowing C, we are lulled into incorrectly thinking that A is a direct cause of B.

With Berkson's fallacy, the effect is opposite: knowledge of C (or inadvertently selecting a biased sample such that C was true on an excess number of occasions) made two otherwise uncorrelated variables appear to be dependent upon one another. The effect was such that occurrence of A seemed to suppress the occurrence of B. (Note that even if I hadn't consciously decided to look only for hung-over animals, their bright red noses would have been easier to spot in the undergrowth, leading to their being over-represented in the survey, which would have had a similar effect.)

While the third variable, C, is ignored, it can confound our scientific efforts, but once brought to our attention, figuring out what causes what actually becomes easier. Thanks to differences in effects, such as those differences between Simpson's and Berkson's paradoxes just sketched, it can actually help us distinguish between different classes of causal relationships. If C screens off A from B, then certain distributions of cause and effect can be ruled out, while certain other causal relationships are excluded when C introduces dependence between A and B. [This paragraph was modified slightly on 12-23-2013 to remove an error.]

Full causal analysis is only possible when we perform controlled interventions (e.g. randomized controlled clinical trials), but if we stretch our intelligence, there is a lot that can still be done when intervention is difficult or impossible to implement - a situation many scientists have to live with. (Cosmology, anyone? geology? archaeology? Just a few examples.)

#### 1 comment:

1. Just noticed,

in deriving equation 1, I invoked the condition that 'neither P(A) nor P(B) is 0 or 1,' but this isn't sufficient for the proof - it also should be true that A and B don't form an exhaustive set of hypotheses. (If they did, P(A+B) would be 1, and the inequality would become an equality.)