## Saturday, November 16, 2013

### The Acid Test of Indifference

In recent posts, I've looked at the interpretation of the Shannon entropy, and the justification for the maximum entropy principle in inference under uncertainty. In the latter case, we looked at how mathematical investigation of the entropy function can help with establishing prior probability distributions from first principles.

There are some prior distributions, however, that we know automatically, without having to give the slightest thought to entropy. If the maximum entropy principle is really going to work, the first thing it has got to be able to do is to reproduce those distributions that we can deduce already, using other methods.

## Friday, November 1, 2013

### Monkeys and Multiplicity

Monkeys love to make a mess. Monkeys like to throw stones. Give a monkey a bucket of small pebbles, and before too long, those pebbles will be scattered indiscriminately in all directions. These are true facts about monkeys, facts we can exploit for the construction of a random number generator.

Set up a room full of empty buckets. Add one bucket full of pebbles and one mischievous monkey. Once the pebbles have been scattered, the number of little stones in each bucket is a random variable. We're going to use this random number generator for an unusual purpose, though. In fact, we could call it a 'calculus of probability,' because we're going to use this exotic apparatus for figuring out probability distributions from first principles1.

## Saturday, October 26, 2013

### Entropy Games

In 1948, Claude Shannon, an electrical engineer working at Bell labs, was interested in the problem of communicating messages along physical channels, such as telephone wires. He was particularly interested in issues like how many bits of data are needed to communicate a message, how much redundancy is appropriate when the channel is noisy, and how much a message can be safely compressed.

In that year, Shannon figured out1 that he could mathematically specify the minimum number of bits required to convey any message. You see every message, every proposition, in fact, whether actively digitized or not, can be expressed as some sequence of answers to yes / no questions, and every string of binary digits is exactly that: a sequence of answers to yes / no questions. So if you know the minimum number of bits required to send a message, you know everything you need to know about the amount of information it contains.

## Friday, October 18, 2013

### Entropy of Kangaroos

All this discussion of scientific method, the deep roots of probability theory, mathematics, and morality is all well and good, but what about kangaroos? As I'm sure most of my more philosophically sophisticated readers appreciate, kangaroos play a necessarily central and vital role in any valid epistemology. To celebrate this fact, I'd like to consider a mathematical calculation that first appeared in the image-analysis literature, just coming up to 30 years ago1. I'll paraphrase the original problem in my own words:

We all know that two thirds of kangaroos are right handed, and that one third of kangaroos drink beer (the remaining two thirds preferring whisky). These are true facts. What is the probability that a randomly encountered kangaroo is a left-handed beer drinker? Find a unique answer.

## Friday, October 11, 2013

### No Such Thing as a Probability for a Probability

In the previous post, I discussed a problem of parameter estimation, in which the parameter of interest is a frequency: the relative frequency with which some data-generating process produces observations of some given type. In the example I chose (mathematically equivalent to Laplace's sunrise problem), we assumed a frequency that is fixed in the long term, and we assumed logical independence between successive observations. As a result, the frequency with which the process produces X , if known, has the same numerical value as the probability that any particular event will be an X. Many authors covering this problem exploit this correspondence, and describe the sought after parameter directly as a probability. This seems to me to be confusing, unnecessary, and incorrect.

We perform parameter estimation by calculating probability distributions, but if the parameter we are after is itself a probability, then we have the following weird riddle to solve: What is a probability for a probability? What could this mean?

A probability is a rational account of one's state of knowledge, contingent upon some model. Subject to the constraints of that model (e.g. the necessary assumption that probability theory is correct), there is no wiggle room with regard to a probability - its associated distribution, if such existed, would be a two-valued function, being everywhere either on or off, and being on in exactly one location. What I have described, however, is not a probability distribution, as the probability at a discrete location in a continuous hypothesis space has no meaning. This opens up a few potential philosophical avenues, but in any case, this 'distribution' is clearly not the one the problem was about, so we don't need to pursue them.

In fact, we never need to discuss the probability for a probability. Where a probability is obtained as the expectation of some other nuisance parameter, that parameter will always be a frequency. To begin to appreciate the generality of this, suppose I'm fitting a mathematical function, y = f(x), with model parameters, θ, to some set of observed data pairs, (x, y). None of the θi can be a probability, since each (x, y) pair is a real observation of some actual physical process - each parameter is chosen to describe some aspect of the physical nature of the system under scrutiny.

Suppose we ask a question concerning the truth of a proposition, Q: "If x is 250, y(x) is in the interval, a = [a1, a2]."

We proceed first to calculate the multi-dimensional posterior distribution over θ-space. Then we evaluate at each point in θ-space the probability distribution for the frequency with which y(250) ∈ [a1, a2]. If y(x) is deterministic, at all frequencies this will be either 1 or 0. Regardless whether or not y is deterministic, the product of this function with the distribution, P(θ), gives the probability distribution over (f, θ), and the integral over this product is the final probability for Q. We never needed a probability distribution over probability space, only over f and θ space, and since every inverse problem in probability theory can be expressed as an exercise in parameter estimation, we have highly compelling reasons to say that this will always hold.

It might seem as though multi-level, hierarchical modeling presents a counter example to this. In the hierarchical case, the function y(x) (or some function higher still up the ladder) becomes itself one of several possibilities in some top-level hypothesis space. We may, for example suspect that our data pairs could be fitted by either a linear function, or a quadratic, in which case our job is to find out which is more suitable. In this case, the probability that y(250) is in some particular range depends on which fitting function is correct, which is itself expressible as a probability distribution, and we seem to be back to having a probability for a probability.

But every multi-level model can be expressed as a simple parameter estimation problem. For a fitting function, yA(x), we might have parameters θA = {θA1, θA2, ....}, and for another function, yB(x), parameters θB = {θB1, θB2, ....}. The entire problem is thus mathematically indistinguishable from a single parameter estimation problem with θ = {θA1, θA2, ...., θB1, θB2, ...., θN}, where θN is an additional hypothesis specifying the name of the true fitting function. By the above argument, none of the θ's here can be a probability. (What does θB1 mean in model A? It is irrelevant: for a given point in the sub-space, θA, the probability is uniform over θB.)

Often, though, it is conceptually advantageous to use the language of multi-level modeling. In fact, this is exactly what happened previously, when we studied various incarnations of the sunrise problem. Here is how we coped:

We had a parameter (see previous post), which we called A, denoting the truth value of some binary proposition. That parameter was itself determined by a frequency, f, for which we devised a means to calculate a probability distribution. When we needed to know the probability that a system with internal frequency, f, would produce 9 events of type X in a row, we made use of the logical independence of subsequent events to say that the P(X) is numerically the same as f (the Bernoulli urn rule). Thus, we were able to make use of the laws of probability (the product rule in this case) to calculate P(9 in a row | this f is temporarily assumed correct) = f 9. Under the assumptions of the model, therefore, for any assumed f, the value 9 is the frequency with which this physical process produces 9 X's out of 9 samples, and our result was again an expectation over frequency space (though this time a different frequency). We actually made 2 translations: from frequency to probability and then from probability back to frequency, before calculating the final probability. It may seem unnecessarily cumbersome, but by doing this, we avoid the nonsense of a probability for a probability.

(There are at least 2 reasons why I think avoiding such nonsense is important. Firstly, when we teach, we should avoid our students harboring the justified suspicion that we are telling them nonsense. The student does not have to be fully conscious that any nonsense was transmitted, for the teaching process to be badly undermined. Secondly, when we do actual work with probability calculus, there may be occasions when we solve problems of an exotic nature, where arming ourselves with normally harmless nonsense could lead to a severe failure of the calculation, perhaps even seeming to produce an instance where the entire theory implodes.)

What if nature is telling us that we shouldn't impose the assumption of logical independence? No big deal, we just need to add a few more gears to the machine. For example, we might introduce some high-order autoregression model to predict how an event depends on those that came before it. Such a model will have a set of n + 1 coefficients, but for each point in the space of those coefficients, we will be able to form the desired frequency distribution. We can then proceed to solve the problem: with what frequency does this system produce an X, given that the previous n events were thing1, thing2, .... The frequency of interest will typically be different to the global frequency for the system (if such exists), but the final probability will always be an expectation of a frequency.

The same kind of argument applies if subsequent events are independent, but f varies with time in some other way. There is no level of complexity that changes the overall thesis.

It might look like we have strayed dangerously close to the dreaded frequency interpretation of probability, but really we haven't. As I pointed out in the linked-to glossary article, every probability can be considered an expected frequency, but owing to the theory ladenness of the procedure that arrives at those expected frequencies, whenever we reach the designated top level of our calculation, we are prevented from identifying probability with actual frequency. To make this identification is to claim to be omniscient. It is thus incorrect to talk, as some authors do, of physical probabilities, as opposed to epistemic probabilities.

## Saturday, October 5, 2013

### Error Bars for Binary Parameters

Propositions about real phenomena are either true or false. For some logical proposition, e.g. "there is milk in the fridge", let A be the binary parameter denoting its truth value. Now, truth values are not in the habit of marching themselves up to us and announcing their identity. In fact, for propositions about specific things in the real world, there is normally no way whatsoever to gain direct access to these truth values, and we must make do with inferences drawn from our raw experiences. We need a system, therefore, to assess the reliability of our inferences, and that system is probability theory. When we do parameter estimation, a convenient way to summarize the results of the probability calculations is the error bar, and it would seem to be necessary to have some corresponding tool to capture our degree of confidence when we estimate a binary parameter, such as A. But what could this error bar possibly look like? The hypothesis space consists of only two discrete points, and there isn't enough room to convey the required information.

Let me pose a different question: how easy is to change your mind? One of the important functions of probability theory is to quantify evidence in terms of how easy it would be for future evidence to change our minds. Suppose I stand at the side of a not-too-busy road, and wonder in which direction the next car to pass me will be travelling. Let A now represent the proposition that any particular observed vehicle is traveling to the left. Suppose that, upon my arrival at the scene, I'm in a position of extreme ignorance about the patterns of traffic on the road, and that my ignorance is best represented (for symmetry reasons) by indifference, and my resulting probability estimate for A is 50%.

Suppose that after a large number of observations in this situation, I find that almost equal numbers of vehicles have been going right as have been going left. This results in a probability assignment for A that is again 50%. Here's the curious thing, though: in my initial state of indifference, only a small number of observations would have been sufficient for me to form a strong opinion that the frequency with which A is true, fA, is close to either 0 or 1. But now, having made a large number of observations, I have accumulated substantial evidence that fA is in fact close to 0.5, and it would take a comparably large number of observations to convince me otherwise. The appropriate response to possible future evidence has changed considerably, but I used the same number, 50%, to summarize my state of information. How can this be?

In fact, the solution is quite automatic. In order to calculate P(A), it is first necessary to assign a probability distribution over frequency space, P(fA). I did this in one of my earliest bog posts, in which I solved a thinly disguised version of Laplace's sunrise problem. Lets treat this traffic problem in the same way. My starting position in the traffic problem, indifference, meant that my information about the relative frequency with which an observed vehicle travels to the left was best encoded with a prior probability distribution that is the same value at all points within the hypothesis space. Lets assume also that we start with the conviction (from whatever source) that the frequency, fA, is constant in the long run and that consecutive events are independent. Laplace's solution (this is, yet again, identical to the sunrise problem he solved just over 200 years ago) provides a neat expression for P(A), known as the rule of succession (p is probability that next event is type X, n is number of observed occurrences of type X events, and N is total number of observed events):
(1)
but his method follows that same route I took when predicting a person's behaviour from past observations: at each possible frequency (between 0 and 1) calculate P(fA) from Bayes' theorem, using the binomial distribution to calculate the likelihood function. The proposition A can be resolved into a set of mutually exclusive and exhaustive propositions about the frequency, fA, giving P(A) = P(A[f1 + f2 + f3 +....]), so that the product rule, applied directly after the extended sum rule means that the final assignment of P(A) consists of integrating over the product fA×P(fA), which we recognize as obtaining the expectation, 〈fA.

The figure below depicts the evolution of the distribution, P(fA  | DI), for the first N observations, for several N. The data all come from a single sequence of binary uniform random variables, and the procedure follows equation (4), from my earlier article. We started, at N = 0, from indifference, and the distribution was flat. Gradually, as more and more data was added, a peak emerged, and got steadily sharper and sharper:

(The numbers on the y-axis are larger than 1, but that's OK because they are probability densities - once the curve is integrated, which involves multiplying each value by a differential element, df, the result is exactly 1.) The probability distribution, P(fA  | DI), is therefore the answer to our initial question: P(fA  | DI) contains all the information we have about the robustness of P(A) against new evidence, and we get our error bar by somehow characterizing the width of P(fA  | DI).

Now, an important principle of probability theory requires that the order with which we incorporate different elements of the data, D, does not affect the final posterior supplied by Bayes' theorem. For D = {d1, d2, d3, ...}, we could work the initial prior over to a posterior using only d1, then, using this posterior as the new prior, repeat for d2, and so on through the list. We could do the same thing, only taking the d's in any order we choose. We could bundle them into sub-units, or we could process the whole damn lot in a single batch. The final probability assignment must be the same in each case. Violation of this principle would invalidate our theory (assuming there is no causal path, e.g. if I'm observing my own mental state, from knowledge of some of the d's to subsequent observed d's).

For example, each curve on the graph above shows the result from a single application of Bayes' theorem, though I could just as well have processed each individual observation separately, producing the same result. This works because the prior distribution is changing with each new bit of data added, gradually recording the combined effect of all the evidence. Each di becomes subsumed into the background information, I, before the next one is treated.

But we might have the feeling that something peculiar happens if we try to carry this principle over to the calculation of P(A | DI). What is the result of observing 9 consecutive cars travelling to the left? It depends what has happened before, obviously. Suppose D1 is now the result of 1 million observations, consisting of exactly 500,000 vehicles moving in each direction. The posterior assignment is almost exactly 50%. Now I see D2, those 9 cars travelling to the left - what is the outcome? The new prior is 50%, the same as it was before the first observation.

What the hell is going on here? How do we account for the fact that these 9 vehicles have a much weaker effect on our rational belief now, than they would have done if they had arrived right at the beginning of the experiment? The outcome of Bayes' theorem is proportional to prior times likelihood: P(A | I)×P(D | AI). Looking at 2 very different situations, 9 observations after 1 million, and 9 observations after zero, the prior is the same, the proposition, A, is the same, and D is the same. The rule of succession with n = N = 9 gives the same result in each case. It seems like we have a problem. We might solve the problem by recognizing that the correct answer comes by first getting P(fA  | DI) then finding its expectation, but how did we recognize this? Is it possible that we rationally reached out to something external to probability theory to figure out that direct calculation of P(A | DI) would not work? Could it be that probability theory is not the complete description of rationality? (Whatever that means.)

Of course, such flights of fancy aren't necessary. The direct calculation of P(A | DI) works perfectly fine, as long as we follow the procedure correctly. Lets define 2 new propositions,

L = A = "the next vehicle to pass will be travelling to the left,"
R = "the next vehicle to pass will be travelling to the right."

With D1 and D2 as before:

D1  = "500,000 out of 1 million vehicles were travelling to the left"
D2  = "Additional to D1, 9 out of 9 vehicles were travelling to the left"

Background information is given by

I1 = "prior distribution over f is uniform, f is constant in the long run,
and subsequent events are independent"

From this we have the first posterior,

 P(L | D1I1) = 0.5
(2)

Now comes the crucial step, we must fully incorporate the information in D1

 I2 = I1D1
(3)

Now, after obtaining D2, the posterior for L becomes
(4)

When we pose and solve a problem that's explicitly about the frequency, f, of the data-generating process, we often don't pay much heed to the updating of I in equation (3), because it is mathematically irrelevant to the likelihood, P(D2 | fD1I1). Assuming a particular value for the frequency renders all the information in D1 powerless to influence this number. But if we are being strict, we must make this substitution, as I is necessarily defined as all the information we have relevant to the problem, apart from the current batch of data (D2, in this case).

The priors in equation (4) are equal, so they cancel out. The likelihood is not hard to calculate, remember what it means: the probability to see 9 out of 9 travelling to the left, given that 500,000 out of 1,000,000 were travelling to the left, previously, and given that the next one will be travelling to the left. That is, what is the probability to have 9 out of 9 travelling to the left, given that in total n = 500,001 out of N = 1,000,001 travel to the left. We can use the same procedure as before to calculate the probability distribution over the possible frequencies, P(f | LI2). For any given frequency, the assumption of independence in I1 means that the only information we have about the probability for any given vehicle's direction is this frequency, and so the probability and the frequency have the same numerical value. This means that for any assumed frequency, the probability to have 9 in a row going to the left is 9, from the product rule. But since we have a probability distribution over a range of frequencies, we take the expectation by integrating over the product P(f)×9.

We can do that integration numerically, and we get a small number: 0.00195321. The counter-part of the likelihood, the one conditioned on R rather than L, is obtained by an analogous process. It produces another small, but very similar number: 0.00195318. From these numbers, the ratio in equation (4) gives 0.5000045, which does not radically disagree with the 0.5000005 we already had. (For comparison, if N = n = 9 was the complete data set, the result would be P(L) = 0.9091, as you can easily confirm.) Thus, when we do the calculation properly, a sample of only 9 makes almost no difference after a sample of 1 million, and peace can be restored in the cosmos.

Using the same procedure, we can confirm also that combining D1 and D2 into a single data set, with N = 1,000,009 and n = 500,009, gives precisely the same outcome for P(L | DI), 0.5000045, exactly as it must.

## Friday, September 13, 2013

### Is Rationality Desirable?

Seriously, why all this fuss about rationality and science, and all that? Can we be just as happy, or even more so, being irrational, as by being rational? Are there aspects of our lives where rationality doesn't help? Might rationality actually be a danger?

Think for a moment about what it means to desire something.

To desire something trivially entails desiring an efficient means to attain it. To desire X is to expect my life to be better if I add X to my possessions. To desire X, and not to avail of a known opportunity to increase my probability to add X to my possessions, therefore, is either (1) to do something counter to my desires, or (2) to desire my life not to get better. Number (2) is strictly impossible – a better life for me is, by definition, one in which more of my desires are fulfilled. Number (1) is incoherent – there can be no motivation for anyone to do anything against their own interests. Behaviour mode (1) is not impossible, but it can only be the result of a malfunction.

Let’s consider some complicating circumstances to check the robustness of this.
1. Suppose I desire a cigarette. Not to smoke a cigarette, however, is clearly in my interests. There is no contradiction here. Besides (hypothetically) wanting to smoke something, I also have other goals, such as a long healthy life, which are of greater importance to me. To desire a cigarette is to be aware of part of my mind that mistakenly thinks this will make my life better, even though in expectation, it will not. This is not really an example of desiring what I do not desire, because a few puffs of nicotine is not my highest desire – when all desires that can be compared on the same dimension are accounted for, the net outcome is what counts. Neither is it an example of acting against my desires if I turn down the offer of a smoke, for the same reason.

2. Suppose I desire to reach the top of a mountain, but I refuse to take the cable car that conveniently departs every 30 minutes, preferring instead to scale the steep and difficult cliffs by hand and foot. Simplistically, this looks like genuinely desiring to not avail of an efficient means to attain my desires, but in reality, it is clearly the case that reaching the summit is only part of the goal, another part being the pleasure derived from the challenging method of getting there.

Despite complications arising from the inner structure of our desires, therefore, for me to knowingly refuse to adopt behaviour that would increase my probability to fulfill my desires is undeniably undesirable. Now, behavior that we know increases our chances to get what we desire has certain general features. For example, it requires an ability to accumulate reliable information about the world. It is not satisfactory to take a wild guess at the best course of action, and just hope that it works. This might work, but it will not work reliably. My rational expectation to achieve my goal is no better than if I do nothing. Reliability begins to enter the picture when I can make informed guesses. I must be able to make reliable predictions about what will happen as a result of my actions, and to make these predictions, I need a model of reality with some fidelity. Not just fidelity, but known fidelity - to increase the probability to achieve my goals, I need a strategy that I have good reasons to trust.

It happens that there is a procedure capable of supplying the kinds of reliable information and models of reality that enable the kinds of predictions we desire to make, in the pursuit of our desires. Furthermore, we all know what it is. It is called scientific method. Remember the reliability criterion? This is what makes science scientific. The gold standard for assessing the reliability of a proposition about the real world is probability theory – a kind of reasoning from empirical experience. Thus the ability of science to say anything worthwhile about the structure of reality comes from its application of probability theory or any of several approximations that are demonstrably good in certain special cases. If there is something that is better than today’s science, then better is the result of a favorable outcome under probabilistic analysis (since 'better' implies 'reliably better'), thus, whatever it is, it is tomorrow’s science.

So, if I desire a thing, then I desire a means to maximize my expectation to get it, so I desire a means to make reliable predictions of the outcomes of my actions, meaning that I desire a model of the world in which I can justifiably invest a high level of belief, thus I desire to employ scientific method, the set of procedures best qualified to identify reliable propositions about reality. Therefore, rationality is desirable. Full stop.

We cannot expect to be as happy by being irrational as by being rational. We might be lucky, but by definition, we cannot rely on luck, and our desires entail also desiring reliable strategies.

Items (A) to (D), below, detail some subtleties related to these conclusions.

(A) Where’s the fun in that?

Seriously? Being rational is always desirable? Seems like an awfully dry, humorless existence, always having to consult a set of equations before deciding what to do!

What this objection amounts to is another example (ii), from above, where the climber chooses to take the difficult route to the top of the mountain. What is really meant by a dry existence is something like elimination of pleasant surprises, spontaneity, and ad-hoc creativity, and that these things are actually part of what we value.

Of course, there are also unpleasant surprises possible, and we do value minimizing those. The capacity to increase the frequency of pleasant surprises, while not dangerously exposing ourselves to trouble is something that, of course, is best delivered through being rational. Being in a contained way irrational may be one of our goals, but as always, the best way to achieve this is by being rational about it. (I won’t have much opportunity to continue my pursuit of irrationality tomorrow, if I die recklessly today.)

(B) Sophistication effect

To be rational (and thus make maximal use of scientific method, as required by a coherent pursuit of our desires) means to make study of likely failure modes of human reasoning (if you are human). This reduces the probability of committing fallacies of reasoning yourself, thus increasing the probability that your model of reality is correct. But, there is a recognized failure mode of human reasoning that actually results from increased awareness of failure modes of reasoning. It goes like this: knowing many of the mechanisms by which seemingly intelligent people can be misled by their own flawed heuristic reasoning methods makes it easy for me for hypothesize reasons to ignore good evidence, when it supports a proposition that I don’t like – “Oh sure, he says he has seen 20 cases of X and no cases of Y, but that’s probably a confirmation bias.”

Does this undermine my argument? Not at all. This is not really a danger of rationality. If anything, it is a danger of education (though one that I confidently predict that a rational analysis will reveal to be not sufficient to argue for reduced education). What has happened, in the above example is of course itself a form of flawed reasoning, it is reasoning based on what I desire to be true, and thus isn't rational. It may be a pursuit of rationality that led me to reason in this way, but this is only because my quest has been (hopefully temporarily) derailed. Thus my desire to be rational (entailed trivially by my possession of desire for anything) makes it often desirable for me to have the support of like-minded rational people, capable of pointing out the error, when even the honest quest for reliable information leads me into a trap of fallacious inference.

(C) Where does it stop?

The assessment of probability is open ended. If there is anything about probability theory that sucks, this is it, but no matter how brilliant the minds that come to work on this problem, no way around it can ever be found, in principle. It is just something we have to live with - pretending it's not there won't make it go away. What it means, though, is that no probability can be divorced from the model within which it is calculated. There is always a possibility that my hypothesis space does not contain a true hypothesis. For example, I can use probability theory to determine the most likely coefficients, A and B, in a linear model used to fit some data, but investigation of the linear model will say nothing about other possible fitting functions. I can repeat a similar analysis using say a three-parameter quadratic fit, and then decide which fitting model is the most likely using Ockham’s razor, but then what about some third candidate? Or what if the Gaussian noise model I used in my assessment of the fits is wrong? What if I suspect that some of the measurements in my data set are flawed? Perhaps the whole experiment was just a dream. These things can all be checked in essentially the same way as all the previously considered possibilities (using probability theory), but it is quite clear that the process can continue indefinitely.

Rationality is thus a slippery concept: how much does it take to be rational? Since the underlying procedure of rationality, the calculation of probabilities, can always be improved by adding another level, won’t it go on forever, precluding the possibility to ever reach a decision?

To answer this, let us note that to execute a calculation capable of deciding how to achieve maximal happiness and prosperity for all of humanity and all other life on Earth is not a rational thing to do if the calculation is so costly that its completion results in the immediate extinction of all humanity and all other life on Earth.

Rationality is necessarily a reflexive process, both (as described above) in that it requires analysis of the potential failure modes of the particular hardware/software combination being utilized (awareness of cognitive biases), and in that it must try to monitor its own cost. Recall that rationality owes its ultimate justification to the fulfillment of desires. These desires necessarily supersede the desire to be rational itself. An algorithm designed to do nothing other than be rational would do literally nothing - so without a higher goal above it, rationality is literally nothing.

Thus, if the cost of the chosen rational procedure is expected to prevent the necessarily higher-level desire being fulfilled, then rationality dictates that the calculation be stopped (or better, not started). Furthermore, the (necessary) desire to employ a procedure that doesn't diminish the likelihood to achieve the highest goals entails a procedure capable of assessing and flagging when such an occurrence is likely.

(D) Going with your gut feeling

On a related issue, concerning again the contingency (the lack of guarantee that the hypothesis space actually contains a true hypothesis) and potential difficulty of a rational calculation, do we need to worry that the possible computational difficulty and, ultimately, the possibility that we will be wrong in the end will make rationality uncompetitive with our innate capabilities of judgment? Only in a very limited sense.

Yes, we have superbly adapted computational organs, with efficiencies far exceeding any artificial hardware that we can so far devise, and capable of solving problems vastly more difficult than any rigorous probability-crunching machine that we can now build. And yes, it probably is rational under many circumstances to favor the rough and ready output of somebody’s bias-ridden squishy brain over the hassle of a near-impossible, but oh-so rigorous calculation. But under what circumstances? Either, as noted, when the cost of the calculation prohibits the attainment of the ultimate goal, or when rationally evaluated empirical evidence indicates that it is probably safe to do so.

Human brain function is at least partially rational, after all. Our brains are adapted for and (I am highly justified in believing) quite successful at making self-serving judgments, which, as noted, is founded upon an ability to form a reliable impression of the workings of our environment. And, as also noted, the degree of rigor called for in any rational calculation is determined by the costs of the possible calculations, the costs of not doing the calculations, and the amount we expect to gain from them.

This is not to downplay the importance of scientific method. Let me emphasize: a reliable estimate of when it is acceptable to rely on heuristics, rather than full-blown analysis, can only come from a rational procedure. The list of known cognitive biases that interfere with sound reasoning is unfortunately rather extensive, and presumably still growing. The science informs us that rather often, our innate judgement exhibits significantly less success than rational procedure.