In an earlier post on the baserate fallacy, I made use of two important terms,
‘falsepositive rate’ and ‘falsenegative rate,’ without taking much time to
explain them. These are concepts we need to be careful with, because, simple
though they are, they have been given terrible names.
Lets start with
the false positive rate for a test. This could mean any one of a number of
things:
(1) it could be the expected proportion of results produced by the test that are both false and positive
(2) it could be the proportion of positive results that are false
(3) it could be the proportion of false results that are positive
So which one is
it? Are you ready?
None of the above.
None of the above.
The false
positive rate is the proportion of negative cases that are registered as
positive by the test. This definition is widespread and agrees with the
Wikipedia article ‘Type I and Type II Errors’. If it is a diagnostic test for
some disease, it is the fraction of healthy people who will be told that they
have the disease (or referred for further diagnosis).
My claim that
the term is confusing, however, is supported by looking at another Wikipedia
article, ‘False Positive Rate,’ which provides the definition: “the probability
of falsely rejecting the null hypothesis.” The null hypothesis is the
proposition that there is no effect to measure – what I have called a negative
case. The probability of falsely rejecting the null hypothesis, therefore,
depends on the probability that there is no effect to measure, while the
proportion of negative cases that are registered as positive does not. The
alternate definition in this second Wikipedia article is the same as my number
(1), above.
Number (2) on
the list above is the posterior probability that the null hypothesis is true,
given that the test has indicated it to be false. It is obtained using Bayes’
theorem. Confusion between this posterior probability and the falsepositive
rate is the baserate fallacy, yet again. Unfortunately, there is little about
the term ‘falsepositive rate’ that strives to steer one away from these
misconceptions.
What makes the
situation much worse is that most scientific research is assessed using the
falsepositive rate, while what we should really be interested in is assigning a posterior probability to the proposition under
investigation.
The
falsepositive rate is often denoted by the greek letter α. In classical
significance testing, α is used to define a significance level – or rather the
significance level defines α. The significance level is chosen such that the
probability that a negative case triggers the alarm is α. A negative case, once
again, is an instance where the null hypothesis is true, and there is no
effect. For example, if two groups are being treated for some affliction, one
group with an experimental drug and the other with a placebo, the null hypothesis
might be that both groups recover at the same rate, as the new treatment has no
specific impact on the disease. If the null hypothesis is correct, then in an
ideal measurement, there will be no difference between the recovery times of
the two groups.
But measurements are not ideal: there is ‘noise’ – random fluctuations in the system under study that produce a nonzero difference between the groups. If we can assign a probability distribution for this noise, however, we can define limits, which, if exceeded by the measurement, suggest that the null hypothesis is false. If x is the measured difference in recovery time for the 2 groups, then there are two points, x_{α/2}, on the tails of the distribution, such that the integral of one of the tails up to this limit is α/2. The total probability contained in these two tail areas, therefore, sums to α. The x_{α/2} points are chosen so that α is equal to some desired significance level, some acceptably low false positive rate. (We can not make the false positive rate too small, because then we would too often fail to spot a real effect – the false negative rate would be too high.) The integration is performed on each side of the error distribution, as we have no certainty in which direction the alternate hypothesis operates: the recovery time with the new drug might be worse than with no treatment, which could still lead to rejection of the null hypothesis.
But measurements are not ideal: there is ‘noise’ – random fluctuations in the system under study that produce a nonzero difference between the groups. If we can assign a probability distribution for this noise, however, we can define limits, which, if exceeded by the measurement, suggest that the null hypothesis is false. If x is the measured difference in recovery time for the 2 groups, then there are two points, x_{α/2}, on the tails of the distribution, such that the integral of one of the tails up to this limit is α/2. The total probability contained in these two tail areas, therefore, sums to α. The x_{α/2} points are chosen so that α is equal to some desired significance level, some acceptably low false positive rate. (We can not make the false positive rate too small, because then we would too often fail to spot a real effect – the false negative rate would be too high.) The integration is performed on each side of the error distribution, as we have no certainty in which direction the alternate hypothesis operates: the recovery time with the new drug might be worse than with no treatment, which could still lead to rejection of the null hypothesis.
A common value
chosen for α is 0.05. That is, if the null hypothesis is true, then the
measured value of x will exceed x_{α/2} about one time in twenty. This
is the basis of how results are reported in probably most of science – if x
exceeds the prescribed significance level, then the null hypothesis is
rejected, and the result is classified as statistically significant and
considered a finding. If the measured value of x is between the two x_{α/2}
points, then the null hypothesis is not rejected, and the outcome of the study
is probably never reported (a fact that contributes hugely to the problem of publication bias in the scientific literature). This system for interpreting scientific data and
reporting outcomes of research is, lets be honest, a travesty.
Firstly, the
whole rationale of the significance test seems to me to be deeply flawed. The
idea is that there is some cutoff, beyond which we can declare the matter
final: ‘Yes, we have a finding here. Grand. Nothing more to do on that
question.’ How manifestly absurd? Such a binary declaration about a hypothesis,
necessary if one wishes to take realworld action based on scientific research,
needs to be based on decision theory, which combines both probability theory
and some appropriate loss function (something that specifies the cost of making
a wrong decision). But the declarations arising in decision theory are not of the
form ‘A is true,’ but rather ‘we must act, and rationality demands that we act
as if A is true.’ Using some standard αlevel is about the crudest substitute
for decision theory you could imagine.
Why should our test be biased so much in favour of the null hypothesis, anyway? The alternate hypothesis, H_{A}, almost always represents an infinity of hypotheses about the magnitude of the possible nonnull effect, so a truly 'unbiased' starting point would seem to be one that deemphasizes H_{0}. Remember that at the core of the frequentist philosophy (not the approach I wish to promote) is the dictum "let the data speak for themselves": don't let prior impressions taint the measurement.
Why should our test be biased so much in favour of the null hypothesis, anyway? The alternate hypothesis, H_{A}, almost always represents an infinity of hypotheses about the magnitude of the possible nonnull effect, so a truly 'unbiased' starting point would seem to be one that deemphasizes H_{0}. Remember that at the core of the frequentist philosophy (not the approach I wish to promote) is the dictum "let the data speak for themselves": don't let prior impressions taint the measurement.
Secondly, when a
finding is reported with a falsepositive probability of 0.05, there appears to
be a feeling of satisfaction among the scientific community that 0.05 is the
probability that the positive finding is false. But metaanalyses regularly do
their best to dispel this myth. For example, looking only at genetic association studies, Ioannidis et al.^{1} reported that 'results of the first study correlate only modestly with subsequent research on the same association,' while Hirschorn et al.^{2} write that 'of the 166 putative associations which have been studied three or more times, only 6 have been consistently replicated.'
Thinking again
about the method for calculating the positive predictive value for a diagnostic
test, the probability that a positive finding is false is not determined only
by the falsepositive rate, α, but also by the false negative rate, and the
prior probability. The false negative rate, by analogy with the false positive
rate, is the proportion of real effects that will be registered as null
findings. It depends on the number of samples taken in the study and the
magnitude of the effect. In the medical trial example, if the new drug helps
people recover twice as fast, then there will be fewer false negatives than if
the difference is only 10%, and a trial with 100 patients in each group will
give a more powerful test than a trial with only 10 in each group.
If D_{f} represents
data corresponding to a finding, with x > x_{α/2}, then from Bayes’ theorem, the probability
that the null hypothesis will have been incorrectly rejected is

(1) 
P(D_{f}  H_{0}, I) is α, and P(D_{f}  H_{A}, I) is 1β, where β is the false negative rate. Using the standard α level of 0.05, I plot below this posterior probability as a function of the prior probability for the alternate hypothesis, H_{A}, for several falsenegative rates. To estimate the posterior error probability for a specific experiment, Wacholder et al.^{3} propose to use the pvalue instead of α (the pvalue is twice the tail integral up to x_{m}, where x_{m} is the measured value of x, rather than x_{α/2}). Strictly, one shouldn't use the pvalue, but P(x_{m}  H_{0}, I). We use α to evaluate the method and so integrate over all possible outcomes that would be registered as findings, while we use P(x_{m}) to investigate a particular experiment – there is only one outcome, so no integration is required. I’m fairly sure Wacholder et al. are aware of this: their goal seems to be to provide an approximate methodology capable of salvaging something useful from an almost ubiquitous and highly flawed set of statistical practices. In this regard, I think they can probably be credited with having made a valuable contribution. The problem with this, though, is that for a specific data set, P(D  H_{A}, I) is not the same as 1β, and can not be determined. The correct procedure, of course, requires resolving H_{A }into a set of quantifiable hypotheses and calculating the appropriate sampling distributions for each of them.
_{ }
_{ }
Posterior probability for H_{0 }following a statistically significant result, with α set at 0.05, plotted vs the prior probability that H_{0 }is false. 
We can see
readily from the above plot that the posterior probability varies hugely, even for a fixed α, and
that α alone (or the pvalue) is next to useless for predicting it. As the
prior probability gets smaller, the posterior error probability associated with
a positive finding approaches unity. Looking closely at a prior of 0.01, which is generous for many experiments, (especially, for example, largescale
genomics studies, where the data and its analysis are now cheap enough to
permit almost every conceivable relationship to be tested, regardless of whether or not they are suggested by other known facts) we can see that for a
lowpower test, with β = 0.8, then P(H_{0}  D_{f}, I) is over 96%.
So we crank up the number of samples, improve our instruments, do everything we
can to reduce the experimental noise, until, miracle of miracles, we have
reduced the false negative rate to almost zero. What is P(H_{0}  D_{f}, I)
now? Still 83%. Bugger.
This is not to
say that the experiments are not worth doing. Science evidently makes
tremendous advances despite these difficulties, and the technology that follows
from the science is the most obvious proof of this. (Besides, any substantial
success in reducing β will also permit an associated reduction of α.) What it
does mean, however, is that the standard ways of reporting ‘findings,’ with
alphas and pvalues, are desperately inadequate. They fail to represent the
information that a data set contains and to convey what should be our rational
degree of belief in a hypothesis, given the empirical evidence available.
Evaluation of this information content and elucidation of these rational
degrees of belief (probabilities) should be the goal of every scientist, and
the communication of these things should be viewed as a privilege to take
delight in.
Update (1952012)
Previously I stated that:
'Colhoun et al.^{4} found that 95% of reported findings concerning the genetic causes of disease are subsequently found to be false.'
This was based on a statement by Wacholder et al.^{3} that:
'Colhoun et al. estimated the fraction of falsepositive findings in studies of association between a genetic variant and a disease to be at least .95.'I have subsequently checked this paper by Colhoun et al., and could not find this estimate. I have adjusted the text accordingly, adding new references that support my original point. I apologize for the error. My only excuse is that, locked as it was behind a paywall, I was unable to access this paper for fact checking, before making a special trip to my local university library. I still recommend the Colhoun et al. paper for their discussion of the unsatisfactory nature of evidence evaluation in their field.
[1] J.P. Ioannidis et al. ‘Replication validity of genetic association studies,’ Nature Genetics 2001, 29 (p. 306)
[2] J.N. Hirschorn et al. ‘A comprehensive review of genetic association studies,’ Genetics in Medicine 2002, 4 (p. 45) Available here.
[2] J.N. Hirschorn et al. ‘A comprehensive review of genetic association studies,’ Genetics in Medicine 2002, 4 (p. 45) Available here.
[3] S. Wacholder, et al. ‘Assessing the probability that a
positive report is false: an approach for molecular
epidemiology studies,’ Journal of the National Cancer Institute, Vol. 96, No. 6
(p. 434), March 17, 2004. Available here.
[4] H.M. Colhoun et al. ‘Problems of reporting genetic associations with complex outcomes,’ Lancet 2003 361:865–72
stock market response analysis
ReplyDeleteTools for Applying the Event Study Methodology and News Analytics: Abnormal Return Calculator,Text Analysis,Regular Expressions,Data Sources  http://www.eventstudytools.com