Uncategorized

Gigerenzer versus Kahneman and Tversky: The 1996 face-off

Through the late 1980s and early 1990s, Gerd Gigerenzer and friends wrote a series of articles critiquing Daniel Kahneman and Amos Tversky’s work on heuristic and biases. They hit hard. As Michael Lewis wrote in The Undoing Project:

Gigerenzer had taken the same angle of attack as most of their other critics. But in Danny and Amos’s view he’d ignored the usual rules of intellectual warfare, distorting their work to make them sound even more fatalistic about their fellow man than they were. He also downplayed or ignored most of their evidence, and all of their strongest evidence. He did what critics sometimes do: He described the object of his scorn as he wished it to be rather than as it was. Then he debunked his description. … “Amos says we absolutely must do something about Gigerenzer,” recalled Danny. … Amos didn’t merely want to counter Gigerenzer; he wanted to destroy him. (“Amos couldn’t mention Gigerenzer’s name without using the word ‘sleazeball,’ ” said UCLA professor Craig Fox, Amos’s former student.) Danny, being Danny, looked for the good in Gigerenzer’s writings. He found this harder than usual to do.

Kahneman and Tversky’s response to Gigerenzer’s work was published in 1996 in Psychological Review. It was one of the blunter responses you will read in academic debates, as the following passages indicate. From the first substantive section of the article:

It is not uncommon in academic debates that a critic’s description of the opponent’s ideas and findings involves some loss of fidelity. This is a fact of life that targets of criticism should learn to expect, even if they do not enjoy it. In some exceptional cases, however, the fidelity of the presentation is so low that readers may be misled about the real issues under discussion. In our view, Gigerenzer’s critique of the heuristics and biases program is one of these cases.

And the close:

As this review has shown, Gigerenzer’s critique employs a highly unusual strategy. First, it attributes to us assumptions that we never made … Then it attempts to refute our alleged position by data that either replicate our prior work … or confirm our theoretical expectations … These findings are presented as devastating arguments against a position that, of course, we did not hold. Evidence that contradicts Gigerenzer’s conclusion … is not acknowledged and discussed, as is customary; it is simply ignored. Although some polemic license is expected, there is a striking mismatch between the rhetoric and the record in this case.

Below are my notes put together on a 16-hour flight on the claims and counterclaims across Gigerenzer’s articles, the Kahneman and Tversky response in Psychological Review, and Gigerenzer’s rejoinder in the same issue. This represents my attempt to get my head around this debate and to understand the degree to which the heat is justified, not to give final judgment (although I do show my leanings). I don’t go to work published after the 1996 articles, although that might be for another day.

I will use Gigerenzer or Kahneman and Tversky’s words to make their arguments when I can. The core articles I refer to are:

Gigerenzer (1991) How to Make Cognitive Illusions Disappear: Beyond “Heuristics and Biases” (pdf)

Gigerenzer (1993) The bounded rationality of probabilistic mental models (pdf)

Kahneman and Tversky (1996) On the Reality of Cognitive Illusions (pdf)

Gigerenzer (1996) On Narrow Norms and Vague Heuristics: A Reply to Kahneman and Tversky (1996) (pdf)

Kahneman and Tversky (1996) Postscript (at the end of their 1996 paper)

Gigerenzer (1996) Postscript (at the end of his 1996 paper)

I recommend reading those articles, along with Kahneman and Tversky’s classic Science article (pdf) as background. (And note that the below debate and Gigerenzer’s critique only relates to two of the 12 “biases” covered in that paper.)

I touch on four of Gigerenzer’s arguments (using most of my word count on the first), although there are numerous other fronts:

  • Argument 1: Does the use of frequentist rather than probabilistic representations make many of the so-called biases disappear? Despite appearances, Kahneman, Tversky and Gigerenzer largely agree on the answer to this question. However, it was largely Gigerenzer’s work that brought this to my attention, so there was clearly some value (for me) to Gigerenzer’s focus.
  • Argument 2: Can you attribute probabilities to single events? Gigerenzer says no. Here there is a fundamental disagreement. I largely agree with Kahneman and Tversky as to whether this point is fatal to their work.
  • Argument 3: Are Kahneman and Tversky’s norms content blind? For particular examples, yes. Generally? No.
  • Argument 4: Should more effort be expended in understanding the underlying cognitive processes or mental models behind these various findings? This is where Gigerenzer’s argument is strongest, and I agree that many of Kahneman and Tversky’s proposed heuristics have weaknesses that need examination.

Putting these four together, I have sympathy for Gigerenzer’s way of thinking and ultimate program of work, but I am much less sympathetic to his desire to pull down Kahneman and Tversky’s findings on the way.

Now into the details.

Argument 1: Does the use of frequentist rather than probabilistic representations make many of the so-called biases disappear?

Gigerenzer’s argues that many biases involving probabilistic decision-making can be “made to disappear” by framing the problems in terms of frequencies rather than probabilities. The back-and-forth on this point centres on three major biases: overconfidence, the conjunction fallacy and base-rate neglect. I’ll take each in turn.

Overconfidence

A typical question from the overconfidence literature reads as follows:

Which city has more inhabitants?

(a) Hyderabad, (b) Islamabad

How confident are you that your answer is correct?

50% 60% 70% 80% 90% 100%

After answering many questions of this form, the usual finding is that where people are 100% confident they had the correct answer, they might be correct only 80% of the time. When 80% confident, they might get only 65% correct. This discrepancy is often called “overconfidence”. [I’ve written elsewhere about the need to disambiguate different forms of overconfidence.]

There are numerous explanations for this overconfidence, such as confirmation bias, although in Gigerenzer’s view this is “a robust fact waiting for a theory”.

But what if we take a different approach to this problem. Gigerenzer (1991) writes:

Assume that the mind is a frequentist. Like a frequentist, the mind should be able to distinguish between single-event confidences and relative frequencies in the long run.

This view has testable consequences. Ask people for their estimated relative frequencies of correct answers and compare them with true relative frequencies of correct answers, instead of comparing the latter frequencies with confidences.

He tested this idea as follows:

Subjects answered several hundred questions of the Islamabad-Hyderabad type … and in addition, estimated their relative frequencies of their correct answers. …

After a set of 50 general knowledge questions, we asked the same subjects, “How many of these 50 questions do you think you got right?”. Comparing their estimated frequencies with actual frequencies of correct answers made “overconfidence” disappear. …

The general point is (i) a discrepancy between probabilities of single events (confidences) and long-run frequencies need not be framed as an “error” and called “overconfidence bias”, and (ii) judgments need not be “explained” by a flawed mental program at a deeper level, such as “confirmation bias”.

Kahneman and Tversky agree:

May (1987, 1988) was the first to report that whereas average confidence for single items generally exceeds the percentage of correct responses, people’s estimates of the percentage (or frequency) of items that they have answered correctly is generally lower than the actual number. … Subsequent studies … have reported a similar pattern although the degree of underconfidence varied substantially across domains.

Gigerenzer portrays the discrepancy between individual and aggregate assessments as incompatible with our theoretical position, but he is wrong. On the contrary, we drew a distinction between two modes of judgment under uncertainty, which we labeled the inside and the outside views … In the outside view (or frequentistic approach) the case at hand is treated as an instance of a broader class of similar cases, for which the frequencies of outcomes are known or can be estimated. In the inside view (or single-case approach) predictions are based on specific scenarios and impressions of the particular case. We proposed that people tend to favor the inside view and as a result underweight relevant statistical data. …

The preceding discussion should make it clear that, contrary to Gigerenzer’s repeated claims, we have neither ignored nor blurred the distinction between judgments of single and of repeated events. We proposed long ago that the two tasks induce different perspectives, which are likely to yield different estimates, and different levels of accuracy (Kahneman and Tversky, 1979). As far as we can see, Gigerenzer’s position on this issue is not different from ours, although his writings create the opposite impression.

So we leave this point with a degree of agreement.

Conjunction fallacy

The most famous illustration of the conjunction fallacy is the “Linda problem”. Subjects are shown the following vignette:

Linda is 31 years old, single, outspoken and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in antinuclear demonstrations.

They are then asked which of the following two alternatives was more probable (either as just those two options, as part of a longer list of options, or across different experimental subjects):

Linda is a bank teller
Linda is a bank teller and is active in the feminist movement

In the original Tversky and Kahneman experiment, when shown only those two options, 85% of subjects chose the second. Tversky and Kahneman argued this was an error as the probability of the conjunction of two events can never be greater than one of its constituents.

Once again Gigerenzer reframed for the frequentist mind (quoting from the 1996 article):

There are 100 persons who fit the description above (i.e. Linda’s). How many of them are:

(a) bank tellers
(b) bank tellers and active in the feminist movement.

As Gigerenzer states:

If the problem is phrased in this (or a similar) frequentist way, then the “conjunction fallacy” largely disappears.

The postulated representativeness heuristic cannot account for this dramatic effect.

Gigerenzer’s 1993 article expands on this latter point:

If the mind solves the problem using a representative heuristic, changes in representation should not matter, because they do not change the degree of similarity. … Subjects therefore should still exhibit the conjunction fallacy.

Kahneman and Tversky’s response starts with the note that their first demonstration of the conjunction fallacy involved judgments of frequency. They asked subjects:

to estimate the number of “seven-letter words of the form ‘—–n-‘ in 4 pages of text.” Later in the same questionnaire, those subjects estimated the number of “seven-letter words of the form ‘—-ing’ in 4 pages of text.” Because it is easier to think of words ending with “ing” than to think of words with “n” in the next-to-last position, availability suggests that the former will bejudged more numerous than the latter, in violation of the conjunction rule. Indeed, the median estimate for words ending with “ing” was nearly three times higher than for words with “n” in the next-to-the-last position. This finding is a counter-example to Gigerenzer’s often repeated claim that conjunction errors disappear in judgments of frequency, but we have found no mention of it in his writings.

Here Gigerenzer stretches his defence of human consistency a step too far:

[T]he effect depends crucially on presenting the two alternatives to a participant at different times, that is, with a number (unspecified in their reports) of other tasks between the alternatives. This does not seem to be a violation of internal consistency, which I take to be the point of the conjunction fallacy.

Kahneman and Tversky also point out that they they had studied the effect of frequencies in other contexts:

We therefore turned to the study of cues that may encourage extensional reasoning and developed the hypothesis that the detection of inclusion could be facilitated by asking subjects to estimate frequencies. To test this hypothesis, we described a health survey of 100 adult men and asked subjects, “How many of the 100 participants have had one or more heart attacks?” and “How many of the 100 participants both are over 55 years old and have had one or more heart attacks?” The incidence of conjunction errors in this problem was only 25%, compared to 65% when the subjects were asked to estimate percentages rather than frequencies. Reversing the order of the questions further reduced the incidence to 11%.

Kahneman and Tversky go on to state:

Gigerenzer has essentially ignored our discovery of the effect of frequency and our analysis of extensional cues. As primary evidence for the “disappearance” of the conjunction fallacy in judgments of frequency, he prefers to cite a subsequent study by Fiedler (1988), who replicated both our procedure and our findings, using the bank-teller problem. … In view of our prior experimental results and theoretical discussion, we wonder who alleged that the conjunction fallacy is stable under this particular manipulation.

Gigerenzer concedes, but then turns to Kahneman and Tversky’s lack of focus on this result:

It is correct that they demonstrated the effect on conjunction violations first (but not for overconfidence bias and the base-rate fallacy). Their accusation, however, is out of place, as are most others in their reply. I referenced their demonstration in every one of the articles they cited … It might be added that Tversky and Kahneman (1983) themselves paid little attention to this result, which was not mentioned once in some four pages of discussion.

A debate about who was first and how much focus each gave to the findings is not substantive, but Kahneman and Tversky (1996) do not leave this problem here. While the frequency representation can reduce error when there is the possibility of direct comparison (the same subject sees and provides frequencies for both alternatives), they have less effect in between-subject experiment designs; that is, where one set of subjects will see one of the options and another set of subject the other:

Linda is in her early thirties. She is single, outspoken, and very bright. As a student she majored in philosophy and was deeply concerned with issues of discrimination and social justice.

Suppose there are 1,000 women who fit this description. How many of them are

(a) high school teachers?

(b) bank tellers? or

(c) bank tellers and active feminists?”

One group of Stanford students (N = 36) answered the above three questions. A second group (N = 33) answered only questions (a) and (b), and a third group (N = 31) answered only questions (a) and (c). Subjects were provided with a response scale consisting of 11 categories in approximately logarithmic spacing. As expected, a majority (64%) of the subjects who had the opportunity to compare (b) and (c) satisfied the conjunction rule. In the between-subjects comparison, however, the estimates for feminist bank tellers (median category: “more than 50”) were significantly higher than the estimates for bank tellers … Contrary to Gigerenzer’s position, the results demonstrate a violation of the conjunction rule in a frequency formulation. These findings are consistent with the hypothesis that subjects use representativeness to estimate outcome frequencies and edit their responses to obey class inclusion in the presence of strong extensional cues.

Gigerenzer in part concedes, and in part battles on:

Hence, Kahneman and Tversky (1996) believe that the appropriate reply is to show that frequency judgments can also fail. There is no doubt about the latter …

[T]he between subjects version of the Linda problem is not a violation of internal consistency, because the effect depends on not presenting the two alternatives to the same subject.

It’s right not to describe this as a violation of internal consistency, but for evidence of representativeness affecting judgement and doing so even with frequentist representations, it makes a good case. It is also difficult to argue that the subjects are making a good judgment. Kahneman and Tversky write:

Gigerenzer appears to deny the relevance of the between-subjects design on the ground that no individual subject can be said to have committed an error. In our view, this is hardly more reasonable than the claim that a randomized between-subject design cannot demonstrate that one drug is more effective than another because no individual subject has experienced the effects of both drugs.

Kahneman and Tversky write further in the postscript, possibly conceding on language but not on their substantive point:

This formula will not do. Whether or not violations of the conjunction rule in the between-subjects versions of the Linda and “ing” problems are considered errors, they require explanation. These violations were predicted from representativeness and availability, respectively, and were observed in both frequency and probability judgments. Gigerenzer ignores this evidence for our account and offers no alternative.

I’m with Kahneman and Tversky here.

Base-rate neglect

Base-rate neglect (or the base-rate fallacy) describes situations where a known base rate of an event or characteristic in a reference population is under-weighted, with undue focus given to specific information on the case at hand. An example is as follows:

If a test to detect a disease whose prevalence is 1/1000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming you know nothing about the person’s symptoms or signs?

The typical result is that around half of the people asked will guess a probability of 95% (even among medical professionals), with less than a quarter giving the correct answer of 2%. The positive result, which has associated errors, is weighted too heavily relative to the base rate of one in a thousand.

Gigerenzer (1991) once again responds with the potential of a frequentist representation to eliminate the bias, drawing on work by Cosmides and Tooby (1990) [The 1990 paper was an unpublished conference paper, but this work was later published here (pdf)]:

One our of 1000 Americans has disease X. A test has been developed to detect when a person has disease X. Every time the test is given to a person who has he disease, the test comes out positive. But sometimes the test also comes out positive when it is given to a person who is completely healthy. Specifically, out of every 1000 people who are perfectly healthy, 50 of them test positive for the disease.

Imagine that we have assembled a random sample of 1000 Americans. They were selected by a lottery. Those who conducted the lottery had no information about the health status of any of these people. How many people who test positive for the disease will actually have the disease? — out of —.

The result:

If the question was rephrased in a frequentist way, as shown above, then the Bayesian answer of 0.02 – that is, the answer “one out of 50 (or 51); – was given by 76% of the subjects. The “base-rate fallacy” disappeared.

Kahneman and Tversky (1996) do not respond to this particular example, beyond a footnote:

Cosmides and Tooby (1996) have shown that a frequentistic formulation also helps subjects solve a base-rate problem that is quite difficult when framed in terms of percentages or probabilities. Their result is readily explained in terms of extensional cues to set inclusion. These authors, however, prefer the speculative interpretation that evolution has favored reasoning with frequencies but not with percentages.

It seems we have agreement on the effect, although a differing interpretation.

Kahneman and Tversky, however, more directly attack the idea that people are natural frequentists.

He [Gigerenzer] offers a hypothetical example in which a physician in a nonliterate society learns quickly and accurately the posterior probability of a disease given the presence or absence of a symptom. … However, Gigerenzer’s speculation about what a nonliterate physician might learn from experience is not supported by existing evidence. Subjects in an experiment reported by Gluck and Bower (1988) learned to diagnose whether a patient has a rare (25%) or a common (75%) disease. For 250 trials the subjects guessed the patient’s disease on the basis of a pattern of four binary symptoms, with immediate feedback. Following this learning phase, the subjects estimated the relative frequency of the rare disease, given each of the four symptoms separately.

If the mind is “a frequency monitoring device,” as argued by Gigerenzer …, we should expect subjects to be reasonably accurate in their assessments of the relative frequencies of the diseases, given each symptom. Contrary to this naive frequentist prediction, subjects’ judgments of the relative frequency of the two diseases were determined entirely by the diagnosticity of the symptom, with no regard for the base-rate frequencies of the diseases. … Contrary to Gigerenzer’s unqualified claim, the replacement of subjective probability judgments by estimates of relative frequency and the introduction of sequential random sampling do not provide a panacea against base-rate neglect.

Gigerenzer (1996) responds:

Concerning base-rate neglect, Kahneman and Tversky … created the impression that there is little evidence that certain types of frequency formats improve Bayesian reasoning. They do not mention that there is considerable evidence (e.g., Gigerenzer & Hoffrage, 1995) and back their disclaimer principally with a disease-classification study by Gluck and Bower (1988), which they summarized thus: “subjects’ judgments of the relative frequency . . . were determined entirely by the diagnosticity of the symptom, with no regard for the base-rate frequencies of the diseases” … To set the record straight, Gluck and Bower said their results were consistent with the idea that “base-rate information is not ignored, only underused” (p. 235). Furthermore, their study was replicated and elaborated on by Shanks (1991), who concluded that “we have no conclusive evidence for the claim . . . that systematic base-rate neglect occurs in this type of situation” (p. 153). Adding up studies in which base-rate neglect appears or disappears will lead us nowhere.

Gigerenzer is right that Kahneman and Tversky were overly strong in their description of the findings of the Gluck and Bower study, but Gigerenzer’s conclusion seems close to that of Kahneman and Tversky. As Kahneman and Tversky wrote:

[I]t is evident that subjects sometimes use explicitly mentioned base-rate information to a much greater extent than they did in our original engineer- lawyer study [another demonstration of base-rate neglect], though generally less than required by Bayes’ rule.

Argument 2: Can you attribute probabilities to single events?

While I leave the question of frequency representations with a degree of agreement, Gigerenzer has a deeper critique of Kahneman and Tversky’s findings. From his 1993 article:

Is the conjunction fallacy a violation of probability theory? Has a person who chooses T&F violated probability theory? The answer is no, if the person is a frequentist such as Richard von Mises or Jerzy Neyman; yes, if he or she is a subjectivist such as Bruno de Finetti; and open otherwise.

The mathematician Richard von Mises, one of the founders of the frequency interpretation, used the following example to make his point:

We can say nothing about the probability of death of an individual even if we know his condition of life and health in detail. The phrase ‘probability of death’, when it refers to a single person, has no meaning at all for us. This is one of the most important consequences of our definition of probability.

(von Mises, 1957/1928: 11)

In this frequentist view, one cannot speak of a probability unless a reference class has been defined. … Since a person is always a member of many reference classes, no unique relative frequency can be assigned to a single person. … Thus, for a strict frequentist, the laws of probability are about frequencies and not about single events such as whether Linda is a bank teller. There, in this view, no judgement about single events can violate probability theory.

… Seen from the Bayesian point of view, the conjunction fallacy is an error.

Thus, choosing T&F in the Linda problem is not a reasoning error. What has been labelled the ‘conjunction fallacy’ here does not violate the laws of probability. It only looks so from one interpretation of probability.

He writes in his 1991 article somewhat more strongly (here talking in the context of overconfidence):

For a frequentist like the mathematician Richard von Mises, the term “probability”, when it refers to a single event, “has no meaning at all for us” … Probability is about frequencies, not single events. To compare the two means comparing applies with oranges.

Even the major opponents of the frequentists – subjectivists such as Bruno de Finetti – would not generally think of a discrepancy between confidence and relative frequency as a “bias”, albeit for different reasons. For a subjectivist, probability is about single events, but rationality is identified with the internal consistency of subjective probabilities. As de Finetti emphasized, “however an individual evaluates the probability of a particular event, no experience can prove him right, or wrong; nor, in general, could any conceivable criterion give any objective sense to the distinction one would like to draw, here, between right and wrong” …

Kahneman and Tversky address this argument across a few of the biases under debate. First, on conjunction errors:

Whether or not it is meaningful to assign a definite numerical value to the probability of survival of a specific individual, we submit (a) that this individual is less likely to die within a week than to die within a year and (b) that most people regard the preceding statement as true—not as meaningless—and treat its negation as an error or a fallacy.

In response, Gigerenzer makes an interesting point that someone asked that question might make a different inference:

One can easily create a context, such as a patient already on the verge of dying, that would cause a sensible person to answer that this patient is more likely to die within a week (inferring that the question is next week versus the rest of the year, because the question makes little sense otherwise). In the same fashion, the Linda problem creates a context (the description of Linda) that makes it perfectly valid not to conform to the conjunction rule.

I think Gigerenzer is right that if you treat the problem as content-blind you might miss the inference the subjects are drawing from the question (more on content-blind norms below). But conversely, Kahneman and Tversky’s general point appears sound.

Kahneman and Tversky also address this frequentist argument in relation to over-confidence:

Proper use of the probability scale is important because this scale is commonly used for communication. A patient who is informed by his surgeon that she is 99% confident in his complete recovery may be justifiably upset to learn that when the surgeon expresses that level of confidence, she is actually correct only 75% of the time. Furthermore, we suggest that both surgeon and patient are likely to agree that such a calibration failure is undesirable, rather than dismiss the discrepancy between confidence and accuracy on the ground that “to compare the two means comparing apples and oranges”

Gigerenzer’s response here is amusing:

Kahneman and Tversky argued that the reluctance of statisticians to make probability theory of norm of all single events “is not generally shared by the public” (p. 585). If this was meant to shift the burden of justification for their norms from the normative theory of probability to the intuitions of ordinary people, it is exceedingly puzzling. How can people’s intuitions be called upon to substitute for the standards of statisticians, in order to prove that people’s intuitions systematically violate the normative theory of probability?

Kahneman and Tversky did not come back on this particular argument, but several points could be made in their favour. First, and as noted above, there can still be errors under frequentist representations. Even if we discard the results with judgments of probability for single events, there is still a strong case for the use of heuristics leading to the various biases.

Second, if a surgeon states they are confident that someone has a 99% probability of complete recovery when they are right only 75% of the time, they are making one of two errors. Either they are making a probability estimate of a single event, which has no meaning at all according to Gigerenzer and von Mises, or they are poorly calibrated according to Kahneman and Tversky.

Third, whatever the philosophically or statistically correct position, we have a practical problem. We have judgements being made and communicated, with subsequent decisions based on those communications. To the extent there are weaknesses in that chain, we will have sub-optimal outcomes.

Putting this together, I feel this argument leaves us at a philosophical impasse, but Kahneman and Tversky’s angle provides scope for practical application and better outcomes. (Look at the training for the Good Judgment Project and associated improvements in forecasting that resulted).

Argument 3: Are Kahneman and Tversky’s norms content blind?

An interesting question about the norms against which Kahneman and Tversky assess the experimental subjects’ heuristics and biases is whether the norms are blind to the content of the problem. Gigerenzer (1996) writes:

[O]n Kahneman and Tversky’s (1996) view of sound reasoning, the content of the Linda problem is irrelevant; one does not even need to read the description of Linda. All that counts are the terms probable and and, which the conjunction rule interprets in terms of mathematical probability and logical AND, respectively. In contrast, I believe that sound reasoning begins by investigating the content of a problem to infer what terms such as probable mean. The meaning of probable is not reducible to the conjunction rule … For instance, the Oxford English Dictionary … lists “plausible,” “having an appearance of truth,” and “that may in view of present evidence be reasonably expected to happen,” among others. … Similarly, the meaning of and in natural language rarely matches that of logical AND. The phrase T&F can be understood as the conditional “If Linda is a bank teller, then she is active in the feminist movement.” Note that this interpretation would not concern and therefore could not violate the conjunction rule.

This is a case where I believe Gigerenzer makes an interesting point on the specific case but is wrong on the broader point. As a start, in discussing their initial results for their 1983 paper, Kahneman and Tversky asked whether people were interpreting the language in different ways, such as asking whether people are taking “Linda is a bank teller” to mean “Linda is a bank teller and not active in the feminist movement.” They considered the content of their problem and ran different experimental specifications to attempt to understand what was occurring.

But as Kahneman and Tversky state in their postscript, critiquing the Linda problem on this point – and only the within subjects experimental design at that – is a narrow view of their work. The point of the Linda problem is to test whether the representativeness of the description changes the assessment. As they write in their 1996 paper:

Perhaps the most serious misrepresentation of our position concerns the characterization of judgmental heuristics as “independent of context and content” … and insensitive to problem representation … Gigerenzer also charges that our research “has consistently neglected Feynman’s (1967) insight that mathematically equivalent information formats need not be psychologically equivalent” … Nothing could be further from the truth: The recognition that different framings of the same problem of decision or judgment can give rise to different mental processes has been a hallmark of our approach in both domains.

The peculiar notion of heuristics as insensitive to problem representation was presumably introduced by Gigerenzer because it could be discredited, for example, by demonstrations that some problems are difficult in one representation (probability), but easier in another (frequency). However, the assumption that heuristics are independent of content, task, and representation is alien to our position, as is the idea that different representations of a problem will be approached in the same way.

This is a point where you need to look across the full set of experimental findings, rather than critiquing them one-by-one. Other experiments have people violating the conjunction rule while betting on sequences generated by a dice, where there were no such confusions to be had about the content.

Much of the issue is also one of focus. Kahneman and Tversky have certainly investigated the question of how representation changes the approach to a problem. However, it is set out in a different way to that Gigerenzer might have liked.

Argument 4: Should more effort be expended in understanding the underlying cognitive processes or mental models behind these various findings?

We now come to an important point: what is the cognitive process behind all of these results? Gigerenzer (1996) writes:

Kahneman and Tversky (1996) reported various results to play down what they believe is at stake, the effect of frequency. In no case was there an attempt to figure out the cognitive processes involved. …

Progress can be made only when we can design precise models that predict when base rates are used, when not, and why

I can see why Kahneman and Tversky focus on the claims regarding frequency representations  when Gigerenzer makes such strong statements about making biases “disappear”. The statement that in no case have they attempted to figure out the cognitive processes involved is also overly strong, as a case could be made that the heuristics are those processes.

However, Gigerenzer believes Kahneman and Tversky’s heuristics are too vague for this purpose. Gigerenzer (1996) wrote:

The heuristics in the heuristics-and-biases program are too vague to count as explanations. … The reluctance to specify precise and falsifiable process models, to clarify the antecedent conditions that elicit various heuristics, and to work out the relationship between heuristics have been repeatedly pointed out … The two major surrogates for modeling cognitive processes have been (a) one-word-labels such as representativeness that seem to be traded as explanations and (b) explanation by redescription. Redescription, for instance, is extensively used in Kahneman and Tversky’s (1996) reply. … Why does a frequency representation cause more correct answers? Because “the correct answer is made transparent” (p. 586). Why is that? Because of “a salient cue that makes the correct answer obvious” (p. 586). or because it “sometimes makes available strong extensional cues” (p. 589). Researchers are no closer to understanding which cues are more “salient” than others, much less the underlying process that makes them so.

The reader may now understand why Kahneman and Tversky (1996) and I construe this debate at different levels. Kahneman and Tversky centered on norms and were anxious to prove that judgment often deviates from those norms. I am concerned
with understanding the processes and do not believe that counting studies in which people do or do not conform to norms leads to much. If one knows the process, one can design any number of studies wherein people will or will not do well.

This passage by Gigerenzer captures the state of the debate well. However, Kahneman and Tversky are relaxed about the lack of full specification, and sceptical that process models are the approach to provide that detail. They write in the 1996 postscript:

Gigerenzer rejects our approach for not fully specifying the conditions under which different heuristics control judgment. Much good psychology would fail this criterion. The Gestalt rules of similarity and good continuation, for example, are valuable although they do not specify grouping for every display. We make a similar claim for judgmental heuristics.

Gigerenzer legislates process models as the primary way to advance psychology. Such legislation is unwise. It is useful to remember that the qualitative principles of Gestalt psychology long outlived premature attempts at modeling. It is also unwise to dismiss 25 years of empirical research, as Gigerenzer does in his conclusion. We believe that progress is more likely to come by building on the notions of representativeness, availability, and anchoring than by denying their reality.

To me, this is the most interesting point of the debate. I have personally struggled to grasp the precise operation of many of Kahneman and Tversky’s heuristics and how their operation would change across various domains. But are more precisely specified models the way forward? Which are best at explaining the available data? We have now had over 20 years of work since this debate to see if this is an unwise or fruitful pursuit. But that’s a question for another day.

Barry Schwartz’s The Paradox of Choice: Why More Is Less

I typically find the argument that increased choice in the modern world is “tyrannising” us to be less than compelling. On this blog, I have approvingly quoted Jim Manzi’s warning against extrapolating the results of an experiment on two Saturdays in a particular store – the famous jam experiment – into “grandiose claims about the benefits of choice to society.” I recently excerpted a section from Bob Sugden’s excellent The Community of Advantage: A Behavioural Economist’s Defence of the Market on the idea that choice restriction “appeals to culturally conservative or snobbish attitudes of condescension towards some of the preferences to which markets cater.”

Despite this, I liked a lot of Barry Schwartz’s The Paradox of Choice: Why More Is Less. I still disagree with some of Schwartz’s recommendations, his view that the “free market” undermines our well-being, and that areas such as “education, meaningful work, social relations, medical care” should not be addressed through markets. I believe he shows a degree of condescension toward other people’s preferences. However, I found that for much of the diagnosis of the problem I agreed with Schwartz, even if that doesn’t always extend to recommending the same treatment.

Schwartz’s basic argument is that increased choice can negatively affect our wellbeing. It can damage the quality of our decisions. We often regret our decisions when we see the trade-offs involved in our choice, with those trade-offs often multiplying with increased choice. We adapt to the consequences of our choices, meaning that the high search costs of search may not be recovered.

The result is that we are not satisfied with our choices. Schwartz argues that once our basic needs are met, much of what we are trying to achieve is satisfaction. So if the new car, phone or brand of salad dressing don’t deliver satisfaction, are we worse off?

The power of Schwartz’s argument varies with the domain. When he deals with shopping, it is easy to see that the choices would be overwhelming to someone who wanted to examine all of the options (do we need all 175 salad dressings that are on display?). People report that they are enjoying shopping less, despite shopping more. But it is hard to feel that a decline in our enjoyment of shopping or the confusion we face looking at a sea of salad dressings is a serious problem.

Schwartz spends little time examining the benefits of increased consumer choice for individuals whose preferences are met, or the effect of the accompanying competition on price and quality. Schwartz has another book in which he tackles the problems with markets, so having not read it I can’t say he doesn’t have a case. But that case is absent from The Paradox of Choice.

In fairness to Schwartz, he does state that it is big jump to extrapolate the increased complexity of shopping into claims that too much choice can “tyrannise”. Schwartz even notes that we do OK with many consumer choices. We implicitly execute strategies such as picking the same product each time.

Schwartz’s argument is more compelling when we move beyond consumer goods into important high-stakes decisions such as those about our health, retirement or work. A poor choice there can have large effects on both outcomes and satisfaction. These choices are of a scale that genuinely challenges our wellbeing.

The experimental evidence that we struggle with high-stakes choices is more persuasive evidence of a problem than experiments involving people having difficulty choosing jam. For instance, when confronted with a multitude of retirement plans, people tend to simply split between them rather than consider the merits or appropriate allocation. Tweak the options presented to them and you can markedly change the allocations. When faced with too many choices, they may simply not choose.

Schwartz’s argument about our failures when choosing draws heavily from the heuristics and biases literature, and a relatively solid part of the literature at that: impatience and inter-temporal inconsistency, anchoring and adjustment, availability, framing and so on. But in some ways, this isn’t the heart of Schwartz’s case. People are susceptible to error even when there are few choices, which is the typical scenario in the experiments in which these findings are made. And much of Schwartz’s case would hold even if we were not susceptible to these biases.

Rather, much of the problem that Schwartz identifies comes when we approach choices as maximisers instead of satisficers. Maximisation is the path to disappointment in a world of massive choice, as you will almost certainly not select the best option. Maximisers may not even make a choice as they are not comfortable with compromises and will tend to want to keep looking.

Schwartz and some colleagues created a maximisation scale, where survey respondents rate themselves against statements such as “I never settle for second best.” Those who rated high on the maximisation were less happy with life, less optimistic, more depressed and score high on regret. Why this correlation? Schwartz believes there is a causal role and that learning how to satisfice could increase happiness.

What makes this result interesting is that maximisers make better decisions when assessed objectively. Is objective or subjective success more important? Schwartz considers that once we have met our basic needs, what matters most is how we feel. Subjective satisfaction is the most important criteria.

I am not convinced that the story of satisfaction from particular choices flows into overall satisfaction. Take a particular decision and satisfice, and perhaps satisfaction for that particular decision is higher. Satisfice for every life decision, and what does your record of accomplishment look like? What is your assessment of satisfaction then? At the end of the book, Schwartz does suggest that we need to “choose when to choose”, and leave maximisation for the important decisions, so it seems he feels maximisation is important on some questions.

I also wonder about the second order effects. If everyone satisficed to achieve higher personal satisfaction, what would we lose? How much do we benefit from the refusal of maximisers such as Steve Jobs or Elon Musk to settle for second best. Would a more satisfied world have less of the amazing accomplishments that give us so much value? Even if there were a personal gain to taking the foot off the pedal, would this involve broader cost?

An interesting thread relating to maximisation concerns opportunity costs. Any economist will tell you that opportunity cost – the opportunity you forgo by choosing an option – is the benchmark against which options should be assessed. But Schwartz argues that assessing opportunity costs has costs in itself. Being forced to make decisions with trade-offs makes people unhappy, and considering the opportunity costs makes those trade-offs salient.

The experimental evidence on considering trade-offs is interesting. For instance, in one experiment a groups of doctors were given a case history and a choice between sending the patient to a specialist or trying one other medication first. 75% choose the medication. Give the same choice to another group of doctors, but with the addition of a second medication option, and this time only 50% chose medication. Choosing the specialist is a way of avoiding a decision between the two medications. When there are trade-offs, all options can begin to look unappealing.

Another problem lurking for maximisers is regret, as the only way to avoid regret is to make the best possible decision. People will often avoid decisions if they could cause regret, or they aim for the regret minimising decision (which might be considered a form of satisficing).

There are some problems that arise even without the maximisation mindset.  One is that expectations may increase with choice. Higher expectations create a higher benchmark to achieve satisfaction, and Schwartz argues that these expectations may lead to an inability to cope rather than more control. High expectations create the burden of meeting them. For example, job options are endless. You can live anywhere in the world. The nature of your relationships – such as decisions about marriage – have a flexibility far above that of our recent past. For many, this creates expectations that are unlikely to be met. Schwartz does note the benefits of these options, but the presence of a psychological cost means the consequences are not purely positive.

Then there is unanticipated adaptation. People tend to predict bigger hypothetical changes in their satisfaction than that reported by those who experienced the events. Schwartz draws on the often misinterpreted paper that compares the happiness of lottery winners with para- and quadriplegics. He notes that the long-term difference in happiness between the two groups is smaller than you would expect (although I am not sure what you would expect on a 5-point scale). The problem with unanticipated adaptation is that the cost of search does not get balanced by the subjective benefit that the chooser was anticipating.

So what should we do? Schwartz offers eleven steps to reduce the burden of choosing. Possibly the most important is the need to choose when to choose. Often it is not that any particular choice is problematic (although some experiments suggest they are). Rather, it is the cumulative effect that is most damaging. Schwartz suggests picking those decisions that you want to invest effort in. Choosing when to choose allows adequate time and attention when we really want to choose. I personally do this: a wardrobe full of identical work shirts (although this involved a material initial search cost), a regular lunch spot, and many other routines.

Schwartz also argues that we should consider the opportunity costs of considering opportunity costs. Being aware of all the possible trade-offs, particularly when no option can dominate on all dimensions, is a recipe for disappointment. Schwartz suggests being a satisficer and only consider other options when you need to.

The final recommendation I will note is the need to anticipate adaptation. I personally find this a useful tool. Whenever I am making a new purchase I tend to recall a paragraph in Geoffrey Miller’s Spent, which often changes my view on a purchase:

You anticipate the minor mall adventure: the hunt for the right retail environment playing cohort-appropriate nostalgic pop, the perky submissiveness of sales staff, the quest for the virgin product, the self-restraint you show in resisting frivolous upgrades and accessories, the universe’s warm hug of validation when the debit card machine says “Approved,” and the masterly fulfillment of getting it home, turned on, and doing one’s bidding. The problem is, you’ve experienced all this hundreds of times before with other products, and millions of other people will experience it with the same product. The retail adventure seems unique in prospect but generic in retrospect. In a week, it won’t be worth talking about.

Miller’s point in that paragraph was about the signalling benefits of consumerism, but I find a similar mindset useful when thinking about the adaptation that will occur.

Gary Klein’s Sources of Power: How People Make Decisions

Summary: An important book describing how many experts make decisions, but with a lingering question mark about how good these decisions actually are.

—-

Gary Klein’s Sources of Power: How People Make Decisions is somewhat of a classic, with the version I read being a 20th anniversary edition issued by MIT Press. Klein’s work on expert decision making has reached a broad audience through Malcolm Gladwell’s Blink, and Klein’s adversarial collaboration with Daniel Kahneman (pdf) has given his work additional academic credibility.

However, throughout the growing application of behavioural science in public policy and the private sphere, I have rarely seen Klein’s work practically applied to improve decision making. The rare occasions where I see it referenced typically involve an argument that the conditions for the development of expertise do not exist in a particular domain.

This lack of application partly reflects the target of Klein’s research. Sources of Power is an exploration of what Klein calls naturalistic decision making. Rather than studying novices performing artificial tasks in the laboratory, naturalistic decision making involves the study of experienced decision makers performing realistic tasks. Klein’s aim is to document the strengths and capabilities of decision makers in natural environments with high stakes, such as lost lives or millions of dollars down the drain. It often involves uncertainty or missing information. The goals may be unclear. Klein’s focus is therefore in the field and the decisions of people such as firefighters, nurses, pilots and military personnel. They are typically people who have had many years of experience. They are “experts”.

The exploration of naturalistic decision making contrasts with the heuristics and biases program, which typically focuses on the limitations of decision makers and is the staple fodder of applied behavioural scientists. Using the findings of experimental outputs from the heuristics and biases program to tweak decision environments and measure the response across many decision makers (typically through a randomised controlled trial) is more tractable than exploring, modifying and measuring the effect of interventions to improve the rare, high-stakes decisions of experts in environments where the goal itself might not even be clear.

Is Klein’s work “science”?

The evidence that shapes Sources of Power was typically obtained through case interviews with decision makers and by observing these decision makers in action. There are no experiments, with the data obtained through interviews. The interviews are coded for analysis to attempt to find patterns in the approaches of the decision makers.

Klein is cognisant of the limitations of this approach. He notes that he gives detailed descriptions of each study so that we can judge the weaknesses of his approach ourselves. This brings his approach closer to what he considers to be a scientific piece of research. Klein writes:

What are the criteria for doing a scientific piece of research? Simply, that the data are collected so that others can repeat the study and that the inquiry depends on evidence and data rather than argument. For work such as ours, replication means that others could collect data the way we have and could also analyze and code the results as we have done.

The primary “weakness” of his approach is the reliance on observational data, not experiments. As Klein suggests, there are plenty of other sciences that have this feature. His approach is closer to anthropology that psychology. But obviously, an approach constrained to the laboratory has its own limitations:

Both the laboratory methods and the field studies have to contend with shortcomings in their research programs. People who study naturalistic decision making must worry about their inability to control many of the conditions in their research. People who use well-controlled laboratory paradigms must worry about whether their findings generalize outside the laboratory.

Klein has a faith in stories (the subject of one of the chapters) serving as natural experiments linking a network of causes to their effects. It is a fair point that stories can be used to communicate subtle points of expertise, but using them to reliably identify cause-effect relationships seems a step too far.

Recognition-primed decision making

Klein’s “sources of power” for decision-making by experts are intuition, mental simulation, metaphor and storytelling. This is in contrast to what might be considered a more typical decisions-making toolkit (the one you are more likely to be taught) of logical thinking, probabilistic analysis and statistics.

Klein’s workhorse model integrating these sources of power is recognition-primed decision making. This is a two stage process, involving an intuitive recognition of what response is required, followed by mental simulation of the response to see if it will work. Metaphors and storytelling are mental simulation tools. The recognition-primed model involves a blend of intuition and analysis, so is not just sourced from gut feelings.

From the perspective of the decision maker, someone using this model might not consider that they are making a decision. They are not generating options and then evaluating them to determine the best choice.

Instead, they would see their situation as a prototype for which they know the typical course of action right away. As their experience allowed them to generate a reasonable response at the first instance, they do not need to think of others. They simply evaluate the first option, and if suitable, execute. A decision was made in that alternative courses of action were available and could have been chosen. But there was no explicit examination across options.

Klein calls this process singular evaluation, as opposed to comparative evaluation. Singular evaluation may involve moving through multiple options, but each is considered on its own merits sequentially until a suitable option is found, with the search stopping at that point.

The result of this process is “satisficing”, a term coming from Herbert Simon. These experts do not optimise. They pick the first option that works.

Klein’s examination of various experts found that the recognition-primed decision model was the dominant mode of decision making, despite his initial expectation of comparative evaluation. For instance, fireground commanders used recognition-primed decision making for around 80% of the decisions that Klein’s team examined. Klein also points to similar evidence of decision making by chess grandmasters, who spend little time comparing the strengths and weaknesses of one move to another. Most of their time involves simulating the consequences and rejecting moves.

Mental simulation

Mental simulation involves the expert imagining the situation and transforming the situation until can they picture it in a different way from the start. Mental simulations are typically not overly elaborate, and generally rely on just a few factors (rarely more than three). The expert runs the simulation and assesses: can it pass an internal evaluation? Sometimes mental simulation can be wrong, but Klein considers them to be fairly accurate.

Klein’s examples of mental simulation were not always convincing. For example, he describes an economist who mentally simulated what the Polish economy would do following interventions to reduce inflation. It is hard to take seriously single examples of such mental simulation hitting the mark when I am aware of so many backfires in this space. And how would expertise in such economic simulations develop? (More on developing expertise below.)

One strength of simulations is that they can be used where traditional decision analytic strategies do not apply. You can use simulations (or stories) if you cannot otherwise remember every piece of information. Klein points to evidence that this is how juries absorb evidence.

One direct use of simulation is the premortem strategy. Imagine in the future plan has failed and you have to understand why. You can also do simulation through decision scenarios.

Novices versus experts

Expertise has many advantages. Klein notes experts can see the world differently, have more procedures to apply, notice problems more quickly, generate richer mental simulations and have more analogies to draw on. Experts can see things that novices can’t. They can see anomalies, violations of expectancies, the big picture, how things work, additional opportunities and improvisations, future events, small differences, and their own limitations.

Interestingly, while experts tend not to carefully deliberate about the merits of different courses of action, novices need to compare different approaches. Novices are effectively thinking through the problem from scratch. The rational choice method helps us when we lack the expertise to assess a situation.

Another contrast is where effort is expended. Experts spend most of their effort on situation assessment – this gives the answers. Novices spend more time on determining the course of action.

One interesting thread concerned what happened when time pressure was put on chess players. Time constraints barely degraded the performance of masters, while it destroyed that of novices. The masters often came up with their best move first, so there is no need for the time to test a lot of options.

Developing good decision making

Given the differences between novices and experts, how should novices develop good decision making? Klein suggests this should not be done through training in formal methods of analysis. In fact, this could get in the way of developing expertise. There is also no need to teach the recognition-primed model as it is descriptive: it shows what good decision makers already do. We shouldn’t teach people to think like experts.

Rather, we should teach people to learn like experts. They should engage in deliberate practice, obtain feedback that is accurate and timely, and enrich learning by reviewing prior experience and examining mistakes. The intuition that drives recognition grows out of experience.

Recognition versus analytical methods

Klein argues that recognition strategies are not a substitute for analytical methods, but an improvement. Analytical methods are the fallback for those without experience.

Klein sees a range of environments where recognition strategies will be the superior options. These include the presence of time pressure, when the decision maker is experienced in the domain, when conditions are dynamic (meaning effort can be rendered useless if conditions shift), and when the goals ill-defend (making it hard to develop evaluation criteria). Comparative evaluation is more useful where people have to justify choice, where it is required for conflict resolution, where you are trying to optimise (as opposed to finding just workable option), and where the decision is computationally complex (e.g. investment portfolio).

From this, it is hard to use a rigorous analytical approach in many natural settings. Rational, linear approaches run into problems when the goal is shifting or ill-defined.

Diagnosing poor decisions

I previously posted some of Klein’s views on the heuristics and biases approach to assessing decision quality. Needless to say, Klein is sceptical that poor decisions are largely due to faulty reasoning. More effort should be expended in finding the sources of poor decisions, rather than blaming the operator.

Klein describes a review a sample of 25 decisions with poor outcomes (from 600 he had available) to assess what went wrong. Sixteen outcomes were due to lack of experience, such as someone not realising that construction of the building on fire was problematic. The second most common issue was lack of information. The third most common involved noticing but explaining away problems during mental simulation – possibly involving bias.

Conditions for expertise

The conditions for developing the expertise for effective recognition-primed decision making is delved into in depth in Klein’s article with Daniel Kahneman, Conditions for Intuitive Exertise: A Failure to Disagree (pdf). However, Klein does examine this area to some degree in Sources of Power.

Klein notes that it is one thing to gain experience, and another to turn that into expertise.  It is often difficult to see cause and effect relationships. There is typically delay between the two. It is difficult to disentangle luck and skill. Drawing on work by Jim Shanteau, Klein also notes that expertise was hard to develop when the domain is dynamic, we need to predict human behaviour, there is less chance for feedback, there is not enough repetition to get sense of typicality or there are fewer trials. Funnily enough, this description seems to align somewhat with many of the naturalistic decision making environments.

Despite these barriers, Klein believes that it is possible to get expertise in some areas, such as fighting fires, caring for hospitalised infants or flying planes. Less convincingly (given some research in the area), he also references the fine discrimination of wine tasters (e.g.).

Possibly my biggest criticism of Klein’s book relates to this final point, as he provides little evidence for the conversion of experience into expertise beyond the observation that in many of these domains novices are completely lost. Is the best benchmark a comparison with a novice who has no idea, or is it better to look at, say, a simple algorithm, statistical rule, or someone with basic training?

A review of 2018 and some thoughts on 2019

As a record largely for myself, below are some notes in review of 2018 and a few thoughts about 2019.

Writing: I started 2018 intending to post to this blog at least once a week, which I did. I set this objective as I had several long stretches in 2017 where I dropped the writing habit.

I write posts in batches and schedule in advance, so the weekly target did not require a weekly focus. However, at times I wrote shorter posts that I might not have otherwise written to make sure there was a sufficient pipeline. Traffic for the blog was similar to the previous year, with around 100,000 visitors, although unlike previous years there was no runaway post with tens of thousands of views. Three of the 10 most popular posts were written during the year.

In 2019, I am relaxing my intention to post on the blog every week (although that will largely still happen). I will prioritise writing on what I want to think about, rather than achieving a consistent flow of posts.

I wrote three articles for Behavioral Scientist during the year. I plan to increase my output for external forums such as Behavioural Scientist in 2019. My major rationale for blogging is that I think (and learn) about issues better when I write for public consumption, and forums outside of the blog escalate that learning experience.

I also had a paper published in Evolution & Human Behavior (largely written in 2017). For the immediate future, I plan to stop writing academic articles unless I come up with a cracker of an idea. Having another academic publication provides little career value, and the costs of the academic publication process outweigh the limited benefit that comes from the generally limited readership.

For some time I have had two book ideas that I would love to attack, but I did not progress in 2018. One traces back to my earlier interest and writings on the link between economics and evolutionary biology. The other is an attempt to distil the good from the bad in behavioural economics – a sceptical take if you like. Given what else is on my plate (particularly a new job), I’d need a strong external stimulus to progress these in 2019, but I wouldn’t rule out dabbling with one.

Reading: I read 79 books in 2018 (47 non-fiction, 32 fiction). I read fewer books than a typical year, largely due to having three children four and under. My non-fiction selection was less deliberate than I would have liked and included fewer classics than I planned. In 2019 I plan to read more classics and more books that directly relate to what I am doing or want to learn, and picking up fewer books on whim.

I’m not sure how many academic articles I read, but I read at least part of an article most days.

Focus: I felt the first half of 2018 was more focused and productive than the second. For various reasons, I got sucked into a few news cycles late in the year, with almost zero benefit. I continued to use most of the productivity hacks described in my post on how I focus (and live) – one of the most popular posts this year, and continue to struggle with the distraction of email.

I am meditating less than when I wrote that post (then daily), but still do meditate a couple of times a week for 10 to 20 minutes when I am able to get out for a walk at lunch. I use 10% Happier for this. I find meditation most useful as a way to refocus, as opposed to silencing or controlling the voices in my head.

Health: I continue to eat well (three parts Paleo, one part early agriculturalist), excepting the Christmas break where I relax all rules (I like to think of myself as a Hadza tribesman discovering a bunch of bee hives, although it’s more a case of me simply eating like the typical Australian for a week or two).

I surf at least once most weeks. My gym attendance waxed and waned based on various injuries (wrist, back), so my strength and fitness is below the average level of the last five years, although not by a great amount.

With all of the chaps generally sleeping through the night, I had the best year of sleep I have had in three years.

Work: I lined up a new role to start in late January this year. For almost three years I have been building the data science capability in my organisation, and have recruited a team that is largely technically stronger than me and can survive (thrive) without me. I’m shifting back into a behavioural science role (although similarly building a capability), which is closer to my interests and skillset. I’m also looking forward to shifting back into the private sector.

I plan to use the change in work environment to reset some work habits, including batching email and entering each day with a better plan on how I will tackle the most important (as opposed to the most urgent) tasks.

Life: Otherwise, I had few major life events. I bought my first house (settlement coming early this year). It meets the major goals of being five minutes walk from a surfable beach, next to a good school, and sufficient to cater to our needs for at least the next ten years.

Another event that had a large effect on me was an attempt to save a drowning swimmer while surfing at my local beach (some news on it here and here). It reinforced something that I broadly knew about myself – that I feel calm and focused in a crisis, but typically dwell heavily on it in the aftermath. My attention was shot for a couple of weeks after. It was also somewhat of a learning experience of how difficult a water rescue is and how different CPR is on a human compared to a training dummy. My thinking about this day has brought a lot of focus onto what I want to do this year.

Carol Dweck’s Mindset: Changing the Way You Think to Fulfil Your Potential

I did not find Carol Dweck’s Mindset: Changing the Way You Think to Fulfil Your Potential to be a compelling translation of academic work into a popular book. To all the interesting debates concerning growth mindset – such as Scott Alexander’s series of growth mindset posts (1, 2, 3 and 4), the recent meta-analysis (with Carol Dweck response), and replication of the effect – the book adds little material that might influence your views. If you want to better understand the case for (or against) growth mindset and its link with ability or performance, skip the book, follow the above links and go to the academic literature.

As a result, I will limit my comments on the book to a few narrow points, and add a dash of personal reflection.

In the second in his series, Alexander describes two positions on growth mindset. The first is the “bloody obvious position”:

[I]nnate ability might matter, but that even the most innate abilityed person needs effort to fulfill her potential. If someone were to believe that success were 100% due to fixed innate ability and had nothing to do with practice, then they wouldn’t bother practicing, and they would fall behind. Even if their innate ability kept them from falling behind morons, at the very least they would fall behind their equally innate abilityed peers who did practice.

Dweck and Alexander (and I) believe this position.

Then there is the controversial position:

The more important you believe innate ability to be compared to effort, the more likely you are to stop trying, to avoid challenges, to lie and cheat, to hate learning, and to be obsessed with how you appear before others. …

To distinguish the two, Alexander writes:

In the Bloody Obvious Position, someone can believe success is 90% innate ability and 10% effort. They might also be an Olympian who realizes that at her level, pretty much everyone is at a innate ability ceiling, and a 10% difference is the difference between a gold medal and a last-place finish. So she practices very hard and does just as well as anyone else.

According to the Controversial Position, this athlete will still do worse than someone who believes success is 80% ability and 20% effort, who will in turn do worse than someone who believes success is 70% ability and 30% effort, all the way down to the person who believes success is 0% ability and 100% effort, who will do best of all and take the gold medal.

The bloody obvious and controversial positions are often conflated in popular articles, and in Dweck’s book the lack of differentiation is shifted up another gear. The book is interspersed with stories about people expending some effort to improve or win, with almost no information as to what they believe about growth and ability. The fact that they are expending effort is almost taken to be evidence of the growth mindset. At best the stories are evidence toward the bloody obvious position.

But Dwecks’s strong statements about growth mindset through the book make it clear that she holds the controversial position. Here are some snippets from the introduction:

Believing that your qualities are carved in stone—the fixed mindset—creates an urgency to prove yourself over and over.

[I]t’s startling to see the degree to which people with the fixed mindset do not believe in effort.

It’s not just that some people happen to recognize the value of challenging themselves and the importance of effort. Our research has shown that this comes directly from the growth mindset.

Although Dweck marshals her stories from business and sports to support these “controversial position” claims, it does’t work absent the evidence of beliefs. Add in the survivorship bias in the examples at hand, plus the halo effect in assessing whether the successful people have a “growth mindset”, and there is little compelling evidence that these people held a growth or fixed mindset (as per the “controversial position) and that the mindset in turn caused the outcomes.

To find evidence in support of Dweck’s statements you need to turn to the academic work, but Dweck covers her research in little detail. From the limited descriptions in the book, it was often hard to know what the experiment involved and how much weight to give it. The book pointed me to interesting papers, but absent that visit to the academic literature I felt lost.

One point that becomes clear through the book is that Dweck sees people with growth mindsets having a host of other positive traits. At times this feels like growth mindset is being expanded to encompass all positive behaviours. These included:

  • Embracing challenges and persisting after setbacks
  • Seeking and learning from criticism
  • Understanding the need to invest effort to develop expertise
  • Seeking forgiveness rather than revenge against those who have done them wrong, such as when they are bullied
  • Being happy whatever their outcomes (be it their own, their team’s or their child’s)
  • Compassion and consideration when coaching, rather than through fear and intimidation
  • More accurate estimation of performance and ability
  • Accurately weighting positive and negative information (compared to the extreme reactions of the fixed mindset people)

I need to better understand the literature on how growth mindset correlates with (or causes) these kinds of behaviours, but a lot is being put into the growth mindset basket.

In some of Dweck’s examples of people without a growth mindset, there is a certain boldness. John McEnroe is a recurring example, despite his seven singles and ten doubles grand slam titles. On McEnroe’s note that part of 1982 did not go as well as expected when little things kept him off his game (he still ended the year number one), Dweck asks “Always a victim of outside forces. Why didn’t he take charge and learn how to perform well in spite of them?” McEnroe later recorded the best single season record in the open era (82-3 in 1984), ending the year at number one for the fourth straight time. McEnroe feels he did not fulfil his potential as he often folded when the going got tough, but would have he had really been more successful with a “growth mindset”?

Similarly, Mike Tyson is labelled as someone who “reached the top, but … didn’t stay there”, despite having the third longest unified championship reign in heavyweight history with eight consecutive defences. Tyson obviously had some behavioural issues, but would he have been the same fighter if he didn’t believe in his ability?

——

On a personal angle. Dweck’s picture of someone with “fixed mindset” is a good description of me. Through primary and high-school I was always the “smartest” kid in (my small rural then regional) school, despite investing close to zero effort outside of the classroom. I spent the evenings before my university entrance exams shooting a basketball.

My results gave me the pick of Australian universities and scholarships, but I then dropped out of my first two attempts at university, and followed that by dropping out of Duntroon (Australia’s army officer training establishment, our equivalent to West Point). For the universities, lecture attendance alone was not enough. I was simply too lazy and immature to make it through Duntroon. (Maybe I lacked “grit“.)

After working in a chicken factory to fund a return to university (not recommended), I finally obtained a law degree, although I did so with a basic philosophy of doing just enough to pass.

Through this stretch, I displayed a lot of Dweck’s archetypical “fixed mindset” behaviours. I loved demonstrating how smart I was in domains where I was sure I would do well, and hated the risk of being shown up as otherwise in any domain where I wasn’t. (My choice of law was somewhat strange in this regard, as my strength is my quantitative ability. I chose law largely because this is what “smart” kids do.) I dealt with failure poorly.

It took five years after graduation before I finally realised that I needed to invest some effort to get anywhere – which happened to be a different direction to where I had previously been heading. I have spent most of my time since then investing in my intellectual capital. I am more than willing to try and fail. I am always looking for new learning opportunities. I am happy asking “dumb” questions. I want to prove myself wrong.

Do I now have a “growth mindset”? I don’t believe that anyone can achieve anything. IQ is malleable but only at the margins, and we have a very poor understanding of how to do this. But I have a strong belief that effort pays off, and that absent effort natural abilities can be wasted. I hold the bloody obvious position but not the controversial position. If I was able to blot from my mind the evidence for, say, the genetic contribution to intelligence, could I do even better?

Despite finding limited value in the book from an intellectual standpoint, I can see its appeal. It was a reminder of the bloody obvious position. It highlighted that many of the so-called growth mindset traits or behaviours can be valuable (whether or not they are accompanied by a growth mindset). There was something in there that suggested I should try a bit harder. Maybe that makes it a useful book after all.

Books I read in 2018

The best books I read in 2018 – generally released in other years – are below. Where I have reviewed, the link leads to that review.

  • Nick Bostrom, Superintelligence: Paths, Dangers, Strategies (2014) – Changed my mind, and gave me a framework for thinking about the problem that I didn’t have before.
  • Annie Duke, Thinking in Bets: Making Smarter Decisions When You Don’t Have All the Facts (2018) – While I have many small quibbles with the content, and it could easily have been a long-form article, I  liked the overarching approach and framing.
  • Gary Klein, Sources of Power: How People Make Decisions (1998) – Rightfully considered classic in decision-making. Review coming soon
  • Michael Lewis’s The Undoing Project (2016) – Despite focusing on Kahneman and Tversky’s relationship, it is also one of the better introductions to their work.
  • Robert Sapolsky’s Behave: The Biology of Humans at Our Best and Worst (2017) – A wonderful examination of what “causes” of our actions. Sapolsky zooms out from the almost immediate activity in our brain, to the actions of our hormones over seconds to hours, through our developmental influences, out to our evolutionary past. Review also coming soon.
  • Robert Sapolsky’s Why Zebras Don’t Get Ulcers (3rd ed, 2004) – Great writing and interesting science.
  • Fred Schwed, Where Are the Customer’s Yachts? (1955) – Timeless commentary on the value delivered by the financial services sector
  • Robert Sugden, The Community of Advantage: A Behavioural Economist’s Defence of the Market (2018) – The most compelling critique of the practical application of behavioural economics that I have read.
  • Joseph Conrad, Lord Jim – I love Conrad. Nostromo is possibly my favourite book.
  • Daphne Du Maurier, Rebecca
  • Henry James, Turn of the Screw

Below is the full list of books that I read in 2018 (with links where reviewed and starred if a reread). Relative to previous years, I read (and reread) fewer books in total, less non-fiction, more fiction. That was largely a consequence of regularly reading my youngest to sleep.

My non-fiction reading through 2018 was less deliberate than I would have liked. There are fewer timeless pieces in the list than usual, with many of the choices based on whim or the particular piece of work I was doing at the time.

Non-Fiction

Fiction

  • Christopher Buckley, Thank You For Smoking*
  • Edgar Rice Burroughs, The Return of Tarzan
  • Edgar Rice Burroughs, Tarzan of the Apes
  • Ray Bradbury, Farenheit 451*
  • Joseph Conrad, Lord Jim
  • James Fenimoore Cooper, The Last of the Mohicans
  • Charles Dickens, Bleak House
  • Charles Dickens, A Christmas Carol
  • Charles Dickens, A Tale of Two Cities
  • Charles Dickens, Hard Times
  • Wilkie Collins, The Moonstone
  • Arthur Conan Doyle, A Study in Scarlet*
  • Arthur Conan Doyle, The Hound of the Baskervilles*
  • Arthur Conan Doyle, The Lost World
  • Daphne Du Maurier, Rebecca
  • George Elliott, Middlemarch
  • Gustave Flaubert, Madame Bovary
  • Henry James, Turn of the Screw
  • James Joyce, The Dubliners
  • Andrew Lang, The Arabian Nights
  • George Bernard Shaw, Pygmalion
  • Upton Sinclair, The Jungle
  • Robert Louis Stevenson, Kidnapped
  • Bram Stoker, Dracula*
  • JRR Tolkein, The Hobbit*
  • Mark Twain, The Adventures of Huckleberry Finn
  • Mark Twain, The Adventures of Tom Sawyer*
  • Jules Verne, Journey to the Center of the Earth
  • Andy Weir, The Martian
  • Oscar Wilde, The Picture of Dorian Gray
  • PG Wodehouse, Carry on, Jeeves
  • PG Wodehouse, Meet Mr Mulliner

Gary Klein on confirmation bias in heuristics and biases research, and explaining everything

Confirmation bias

In Sources of Power: How People Make Decisions (review coming soon), Gary Klein writes:

Kahneman, Slovic, and Tversky (1982) present a range of studies showing that decision makers use a variety of heuristics, simple procedures that usually produce an answer but are not foolproof. … The research strategy was not to demonstrate how poorly we make judgments but to use these findings to uncover the cognitive processes underlying judgments of likelihood.

Lola Lopes (1991) has shown that the original studies did not demonstrate biases, in the common use of the term. For example, Kahneman and Tversky (1973) used questions such as this: “Consider the letter R. Is R more likely to appear in the first position of a word or the third position of a word?” The example taps into our heuristic of availability. We have an easier time recalling words that begin with R than words with R in the third position. Most people answer that R is more likely to occur in the first position. This is incorrect. It shows how we rely on availability.

Lopes points out that examples such as the one using the letter R were carefully chosen. Of the twenty possible consonants, twelve are more common in the first position. Kahneman and Tversky (1973) used the eight that are more common in the third position. They used stimuli only where the availability heuristic would result in a wrong answer. … [I have posted some extracts of Lopes’s article here.]

There is an irony here. One of the primary “biases” is confirmation bias—the search for information that confirms your hypothesis even though you would learn more by searching for evidence that might disconfirm it. The confirmation bias has been shown in many laboratory studies (and has not been found in a number of studies conducted in natural settings). Yet one of the most common strategies of scientific research is to derive a prediction from a favorite theory and test it to show that it is accurate, thereby strengthening the reputation of that theory. Scientists search for confirmation all the time, even though philosophers of science, such as Karl Popper (1959), have urged scientists to try instead to disconfirm their favorite theories. Researchers working in the heuristics and biases paradigm condemn this sort of bias in their subjects, even as those same researchers perform more laboratory studies confirming their theories.

On explaining everything

On 3 July 1988 a missile fired from the USS Vincennes destroyed a commercial Iran Air flight taking off over the Persian gulf, killing all onboard. The crew of the Vincennes had incorrectly identified the aircraft as an attacking F-14.

Klein writes:

The Fogarty report, the official U.S. Navy analysis of the incident, concluded that “stress, task fixation, an unconscious distortion of data may have played a major role in this incident. [Crew members] became convinced that track 4131 was an Iranian F-14 after receiving the … report of a momentary Mode II. After this report of the Mode II, [a crew member] appear[ed] to have distorted data flow in an unconscious attempt to make available evidence fit a preconceived scenario (‘Scenario fulfillment’).” This explanation seems to fit in with the idea that mental simulation can lead you down a garden path to where you try to explain away inconvenient data. Nevertheless, trained crew members are not supposed to distort unambiguous data. According to the Fogarty report, the crew members were not trying to explain away the data, as in a de minimus explanation. They were flat out distorting the numbers. This conclusion does not feel right.

The conclusion of the Fogarty report was echoed by some members of a five-person panel of leading decision researchers, who were invited to review the evidence and report to a congressional subcommittee. Two members of the panel specifically attributed the mistake to faulty decision making. One described how the mistake seemed to be a clear case of expectancy bias, in which a person sees what he is expecting to see, even when it departs from the actual stimulus. He cited a study by Bruner and Postman (1949) in which subjects were shown brief flashes of playing cards and asked to identify each. When cards such as the Jack of Diamonds were printed in black, subjects would still identify it as the Jack of Diamonds without noticing the distortion. The researcher concluded that the mistake about altitude seemed to match these data; subjects cannot be trusted to make accurate identifications because their expectancies get in the way.

I have talked with this decision researcher, who explained how the whole Vincennes incident showed a Combat Information Center riddled with decision biases. That is not how I understand the incident. My reading of the Fogarty report shows a team of men struggling with an unexpected battle, trying to guess whether an F-14 is coming over to blow them out of the water, waiting until the very last moment for fear of making a mistake, hoping the pilot will heed the radio warnings, accepting the risk to their lives in order to buy some more time.

To consider this alleged expectancy bias more carefully, imagine what would have happened if the Vincennes had not fired and in fact had been attacked by an F-14. The Fogarty report stated that in the Persian Gulf, from June 2, 1988, to July 2, 1988, the U.S. Middle East Forces had issued 150 challenges to aircraft. Of these, it was determined that 83 percent were issued to Iranian military aircraft and only 1.3 percent to aircraft that turned out to be commercial. So we can infer that if a challenge is issued in the gulf, the odds are that the airplane is Iranian military. If we continue with our scenario, that the Vincennes had not fired and had been attacked by an F-14, the decision researchers would have still claimed that it was a dear case of bias, except this time the bias would have been to ignore the base rates, to ignore the expectancies. No one can win. If you act on expectancies and you are wrong, you are guilty of expectancy bias. If you ignore expectancies and are wrong, you are guilty of ignoring base rates and expectancies. This means that the decision bias approach explains too much (Klein, 1989). If an appeal to decision bias can explain everything after the fact, no matter what has happened, then there is no credible explanation.

I’m not sure the right base rate is the proportion of aircraft challenged, but it is still an interesting point.

In contrast to less-is-more claims, ignoring information is rarely, if ever optimal

From the abstract of an interesting paper Heuristics as Bayesian inference under extreme priors by Paula Parpart and colleagues:

Simple heuristics are often regarded as tractable decision strategies because they ignore a great deal of information in the input data. One puzzle is why heuristics can outperform full-information models, such as linear regression, which make full use of the available information. These “less-is-more” effects, in which a relatively simpler model outperforms a more complex model, are prevalent throughout cognitive science, and are frequently argued to demonstrate an inherent advantage of simplifying computation or ignoring information. In contrast, we show at the computational level (where algorithmic restrictions are set aside) that it is never optimal to discard information. Through a formal Bayesian analysis, we prove that popular heuristics, such as tallying and take-the-best, are formally equivalent to Bayesian inference under the limit of infinitely strong priors. Varying the strength of the prior yields a continuum of Bayesian models with the heuristics at one end and ordinary regression at the other. Critically, intermediate models perform better across all our simulations, suggesting that down-weighting information with the appropriate prior is preferable to entirely ignoring it. Rather than because of their simplicity, our analyses suggest heuristics perform well because they implement strong priors that approximate the actual structure of the environment.

The following excerpts from the paper (minus references) help give more context to this argument. First, what is meant by a simple heuristic as opposed to a full-information model?

Many real-world prediction problems involve binary classification based on available information, such as predicting whether Germany or England will win a soccer match based on the teams’ statistics. A relatively simple decision procedure would use a rule to combine available information (i.e., cues), such as the teams’ league position, the result of the last game between Germany and England, which team has scored more goals recently, and which team is home versus away. One such decision procedure, the tallying heuristic, simply checks which team is better on each cue and chooses the team that has more cues in its favor, ignoring any possible differences among cues in magnitude or predictive value. … Another algorithm, take-the-best (TTB), would base the decision on the best single cue that differentiates the two options. TTB works by ranking the cues according to their cue validity (i.e., predictive value), then sequentially proceeding from the most valid to least valid until a cue is found that favors one team over the other. Thus TTB terminates at the first discriminative cue, discarding all remaining cues.

In contrast to these heuristic algorithms, a full-information model such as linear regression would make use of all the cues, their magnitudes, their predictive values, and observed covariation among them. For example, league position and number of goals scored are highly correlated, and this correlation influences the weights obtained from a regression model.

So why might less be more?

Heuristics have a long history of study in cognitive science, where they are often viewed as more psychologically plausible than full-information models, because ignoring data makes the calculation easier and thus may be more compatible with inherent cognitive limitations. This view suggests that heuristics should underperform full-information models, with the loss in performance compensated by reduced computational cost. This prediction is challenged by observations of less-is-more effects, wherein heuristics sometimes outperform full-information models, such as linear regression, in real-world prediction tasks. These findings have been used to argue that ignoring information can actually improve performance, even in the absence of processing limitations. … Gigerenzer and Brighton (2009) conclude, “A less-is-more effect … means that minds would not gain anything from relying on complex strategies, even if direct costs and opportunity costs were zero”.

Less-is-more arguments also arise in other domains of cognitive science, such as in claims that learning is more successful when processing capacity is (at least initially) restricted.

The current explanation for less-is-more effects in the heuristics literature is based on the bias-variance dilemma. … From a statistical perspective, every model, including heuristics, has an inductive bias, which makes it best-suited to certain learning problems. A model’s bias and the training data are responsible for what the model learns. In addition to differing in bias, models can also differ in how sensitive they are to sampling variability in the training data, which is reflected in the variance of the model’s parameters after training (i.e., across different training samples).

A core tool in machine learning and psychology for evaluating the performance of learning models, cross-validation, assesses how well a model can apply what it has learned from past experiences (i.e., the training data) to novel test cases. From a psychological standpoint, a model’s cross-validation performance can be understood as its ability to generalize from past experience to guide future behavior. How well a model classifies test cases in cross-validation is jointly determined by its bias and variance. Higher flexibility can in fact hurt performance because it makes the model more sensitive to the idiosyncrasies of the training sample. This phenomenon, commonly referred to as overfitting, is characterized by high performance on experienced cases from the training sample but poor performance on novel test items. …

Bias and variance tend to trade off with one another such that models with low bias suffer from high variance and vice versa. With small training samples, more flexible (i.e., less biased) models will overfit and can be bested by simpler (i.e., more biased) models such as heuristics. As the size of the training sample increases, variance becomes less influential and the advantage shifts to the complex models.

So what is an alternative explanation to the performance of heuristics?

The Bayesian framework offers a different perspective on the bias-variance dilemma. Provided a Bayesian model is correctly specified, it always integrates new data optimally, striking the perfect balance between prior and data. Thus using more information can only improve performance. From the Bayesian standpoint, a less-is-more effect can arise only if a model uses the data incorrectly, for example by weighting it too heavily relative to prior knowledge (e.g., with ordinary linear regression, where there effectively is no prior). In that case, the data might indeed increase estimation variance to the point that ignoring some of the information could improve performance. However, that can never be the best solution. One can always obtain superior predictive performance by using all of the information but tempering it with the appropriate prior.

Heuristics may work well in practice because they correspond to infinitely strong priors that make them oblivious to aspects of the training data, but they will usually be outperformed by a prior of finite strength that leaves room for learning from experience. That is, the strong form of less-is-more, that one can do better with heuristics by throwing out information rather than using it, is false. The optimal solution always uses all relevant information, but it combines that information with the appropriate prior. In contrast, no amount of data can overcome the heuristics’ inductive biases.

So why have heuristics proven to be so useful? According this Bayesian argument, it is not because of a “computational advantage of simplicity per se, but rather to the fact that simpler models can approximate strong priors that are well-suited to the true structure of the environment.”

An interesting question from this work is whether our minds use heuristics as a good approximation of complex models, or whether heuristics are good approximations of more complex processes that the mind uses. The authors write:

Although the current contribution is formal in nature, it nevertheless has implications for psychology. In the psychological literature, heuristics have been repeatedly pitted against full-information algorithms that differentially weight the available information or are sensitive to covariation among cues. The current work indicates that the best-performing model will usually lie between the extremes of ordinary linear regression and fast-and-frugal heuristics, i.e., at a prior of intermediate strength. Between these extremes lie a host of models with different sensitivity to cue-outcome correlations in the environment.

One question for future research is whether heuristics give an accurate characterization of psychological processing, or whether actual psychological processing is more akin to these more complex intermediate models. On the one hand, it could be that implementing the intermediate models is computationally intractable, and thus the brain uses heuristics because they efficiently approximate these more optimal models. This case would coincide with the view from the heuristics-and-biases tradition of heuristics as a tradeoff of accuracy for efficiency. On the other hand, it could be that the brain has tractable means for implementing the intermediate models (i.e., for using all available information but down-weighting it appropriately). This case would be congruent with the view from ecological rationality where the brain’s inferential mechanisms are adapted to the statistical structure of the environment. However, this possibility suggests a reinterpretation of the empirical evidence used to support heuristics: heuristics might fit behavioral data well only because they closely mimic a more sophisticated strategy used by the mind.

There have been various recent approaches looking at the compatibility between psychologically plausible processes and probabilistic models of cognition. These investigations are interlinked with our own, and while most of that work has focused on finding algorithms that approximate Bayesian models, we have taken the opposite approach. This contribution reiterates the importance of applying fundamental machine learning concepts to psychological findings. In doing so, we provide a formal understanding of why heuristics can outperform full-information models by placing all models in a common probabilistic inference framework, where heuristics correspond to extreme priors that will usually be outperformed by intermediate models that use all available information.

The (open access) paper contains a lot more detail – and the maths – and I recommend reading it.

My latest in Behavioral Scientist: Simple heuristics that make algorithms smart

My latest contribution at Behavioral Scientist is up. Here’s an excerpt:

Modern discussions of whether humans will be replaced by algorithms typically frame the problem as a choice between humans on one hand or complex statistical and machine learning models on the other. For problems such as image recognition, this is probably the right frame. Yet much of the past success of algorithms relative to human judgment points us to a third option: the mechanical application of simple models and heuristics.

Simple models appear more powerful when removed from the minds of the human and implemented in a consistent way. The chain of evidence that simple heuristics are powerful tools, that humans use these heuristics, and that these heuristics can make us smart does not bring us to a point where these humans are outperforming simple heuristics or models consistently applied by an algorithm.

Humans are inextricably entwined in developing these algorithms, and in many cases provide the expert knowledge of what cues should be used. But when it comes to execution, taking the outputs of the model gives us a better outcome.

You can read the full article here.

A problem in the world or a problem in the model

In reviewing Michael Lewis’s The Undoing Project, John Kay writes:

Since Paul Samuelson’s Foundations of Economic Analysis, published in 1947, mainstream economics has focused on an axiomatic approach to rational behaviour. The overriding requirement is for consistency of choice: if A is chosen when B is available, B will never be selected when A is available. If choices are consistent in this sense, their outcomes can be described as the result of optimisation in the light of a well-defined preference ordering.

In an impressive feat of marketing, economists appropriated the term “rationality” to describe conformity with these axioms. Such consistency is not, however, the everyday meaning of rationality; it is not rational, though it is consistent, to maintain the belief that there are fairies at the bottom of the garden in spite of all evidence to the contrary. …

… In the 1970s, however, Kahneman and Tversky began research that documented extensive inconsistency with those rational choice axioms.

What they did, as is common practice in experimental psychology, was to set puzzles to small groups of students. The students often came up with what the economics of rational choice would describe as the “wrong” answer. These failures of the predictions of the theory clearly demand an explanation. But Lewis—like many others who have written about behavioural economics—does not progress far beyond compiling a list of these so-called “irrationalities.”

This taxonomic approach fails to address crucial issues. Is rational choice theory intended to be positive—a description of how people do in fact behave—or normative—a recommendation as to how they should behave? Since few people would wish to be labelled irrational, the appropriation of the term “rationality” conflates these perspectives from the outset. Do the observations of allegedly persistent irrationality represent a wide-ranging attack on the quality of human decision-making—or a critique of the economist’s concept of rationality? The normal assumption of economists is the former; the failure of observation to correspond with theory identifies a problem in the world, not a problem in the model. Kahneman and Tversky broadly subscribe to that position; their claim is that people—persistently—make stupid mistakes.

I have seen many presentations with an opening line of “economists assume we are rational”, quickly followed by conclusions about poor human decision-making, the two being conflated. More often than not, it’s better to ignore economics as a starting point and to simply examine the evidence for poor decision making. That evidence is, of course, much richer – and debatable – than a simple refutation of the basic economics axioms.

One of those debates concerns the Linda problem. Kay continues:

Take, for example, the famous “Linda Problem.” As Kahneman frames it: “Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which of the following is more likely? ‘Linda is a bank teller,’ ‘Linda is a bank teller and is active in the feminist movement.’”

The common answer is that the second alternative—that Linda is more likely to be a feminist bank teller than a bank teller—is plainly wrong, because the rules of probability state that a compound probability of two events cannot exceed the probability of either single event. But to the horror of Kahneman and his colleagues, many people continue to assert that the second description is the more likely even after their “error” is pointed out.

But it does not require knowledge of the philosopher Paul Grice’s maxims of conversation—although perhaps it helps—to understand what is going on here. The meaning of discourse depends not just on the words and phrases used, but on their context. The description that begins with Linda’s biography and ends with “Linda is a bank teller” is not, without more information, a satisfactory account. Faced with such a narrative in real life, one would seek further explanation to resolve the apparent incongruity and, absent of such explanation, be reluctant to believe, far less act on, the information presented.

Kahneman and Tversky recognised that we prefer to tell stories than to think in terms of probability. But this should not be assumed to represent a cognitive failure. Storytelling is how we make sense of a complex world of which we often know, and understand, little.

So we should be wary in our interpretation of the findings of behavioural economists. The environment in which these experiments are conducted is highly artificial. A well-defined problem with an identifiable “right” answer is framed in a manner specifically designed to elucidate the “irrationality” of behaviour that the experimenter triumphantly identifies. This is a very different exercise from one which demonstrates that people make persistently bad decisions in real-world situations, where the issues are typically imperfectly defined and where it is often not clear even after the event what the best course of action would have been.

Kay also touches on the more general criticisms:

Lewis’s uncritical adulation of Kahneman and Tversky gives no credit to either of the main strands of criticism of their work. Many mainstream economists would acknowledge that people do sometimes behave irrationally, but contend that even if such irrationalities are common in the basements of psychology labs, they are sufficiently unimportant in practice to matter for the purposes of economic analysis. At worst, a few tweaks to the standard theory can restore its validity.

From another perspective, it may be argued that persistent irrationalities are perhaps not irrational at all. We cope with an uncertain world, not by attempting to describe it with models whose parameters and relevance we do not know, but by employing practical rules and procedures which seem to work well enough most of the time. The most effective writer in this camp has been the German evolutionary psychologist Gerd Gigerenzer, and the title of one of his books, Simple Heuristics That Make Us Smart, conveys the flavour of his argument. The discovery that these practical rules fail in some stylised experiments tells us little, if anything, about the overall utility of Gigerenzer’s “fast and frugal” rules of behaviour.

Perhaps it is significant that I have heard some mainstream economists dismiss the work of Kahneman in terms not very different from those in which Kahneman reportedly dismisses the work of Gigerenzer. An economic mainstream has come into being in which rational choice modelling has become an ideology rather than an empirical claim about the best ways of explaining the world, and those who dissent are considered not just wrong, but ignorant or malign. An outcome in which people shout at each other from inside their own self-referential communities is not conducive to constructive discourse.