Gary Klein on confirmation bias in heuristics and biases research, and explaining everything

Confirmation bias

In Sources of Power: How People Make Decisions (review coming soon), Gary Klein writes:

Kahneman, Slovic, and Tversky (1982) present a range of studies showing that decision makers use a variety of heuristics, simple procedures that usually produce an answer but are not foolproof. … The research strategy was not to demonstrate how poorly we make judgments but to use these findings to uncover the cognitive processes underlying judgments of likelihood.

Lola Lopes (1991) has shown that the original studies did not demonstrate biases, in the common use of the term. For example, Kahneman and Tversky (1973) used questions such as this: “Consider the letter R. Is R more likely to appear in the first position of a word or the third position of a word?” The example taps into our heuristic of availability. We have an easier time recalling words that begin with R than words with R in the third position. Most people answer that R is more likely to occur in the first position. This is incorrect. It shows how we rely on availability.

Lopes points out that examples such as the one using the letter R were carefully chosen. Of the twenty possible consonants, twelve are more common in the first position. Kahneman and Tversky (1973) used the eight that are more common in the third position. They used stimuli only where the availability heuristic would result in a wrong answer. … [I have posted some extracts of Lopes’s article here.]

There is an irony here. One of the primary “biases” is confirmation bias—the search for information that confirms your hypothesis even though you would learn more by searching for evidence that might disconfirm it. The confirmation bias has been shown in many laboratory studies (and has not been found in a number of studies conducted in natural settings). Yet one of the most common strategies of scientific research is to derive a prediction from a favorite theory and test it to show that it is accurate, thereby strengthening the reputation of that theory. Scientists search for confirmation all the time, even though philosophers of science, such as Karl Popper (1959), have urged scientists to try instead to disconfirm their favorite theories. Researchers working in the heuristics and biases paradigm condemn this sort of bias in their subjects, even as those same researchers perform more laboratory studies confirming their theories.

On explaining everything

On 3 July 1988 a missile fired from the USS Vincennes destroyed a commercial Iran Air flight taking off over the Persian gulf, killing all onboard. The crew of the Vincennes had incorrectly identified the aircraft as an attacking F-14.

Klein writes:

The Fogarty report, the official U.S. Navy analysis of the incident, concluded that “stress, task fixation, an unconscious distortion of data may have played a major role in this incident. [Crew members] became convinced that track 4131 was an Iranian F-14 after receiving the … report of a momentary Mode II. After this report of the Mode II, [a crew member] appear[ed] to have distorted data flow in an unconscious attempt to make available evidence fit a preconceived scenario (‘Scenario fulfillment’).” This explanation seems to fit in with the idea that mental simulation can lead you down a garden path to where you try to explain away inconvenient data. Nevertheless, trained crew members are not supposed to distort unambiguous data. According to the Fogarty report, the crew members were not trying to explain away the data, as in a de minimus explanation. They were flat out distorting the numbers. This conclusion does not feel right.

The conclusion of the Fogarty report was echoed by some members of a five-person panel of leading decision researchers, who were invited to review the evidence and report to a congressional subcommittee. Two members of the panel specifically attributed the mistake to faulty decision making. One described how the mistake seemed to be a clear case of expectancy bias, in which a person sees what he is expecting to see, even when it departs from the actual stimulus. He cited a study by Bruner and Postman (1949) in which subjects were shown brief flashes of playing cards and asked to identify each. When cards such as the Jack of Diamonds were printed in black, subjects would still identify it as the Jack of Diamonds without noticing the distortion. The researcher concluded that the mistake about altitude seemed to match these data; subjects cannot be trusted to make accurate identifications because their expectancies get in the way.

I have talked with this decision researcher, who explained how the whole Vincennes incident showed a Combat Information Center riddled with decision biases. That is not how I understand the incident. My reading of the Fogarty report shows a team of men struggling with an unexpected battle, trying to guess whether an F-14 is coming over to blow them out of the water, waiting until the very last moment for fear of making a mistake, hoping the pilot will heed the radio warnings, accepting the risk to their lives in order to buy some more time.

To consider this alleged expectancy bias more carefully, imagine what would have happened if the Vincennes had not fired and in fact had been attacked by an F-14. The Fogarty report stated that in the Persian Gulf, from June 2, 1988, to July 2, 1988, the U.S. Middle East Forces had issued 150 challenges to aircraft. Of these, it was determined that 83 percent were issued to Iranian military aircraft and only 1.3 percent to aircraft that turned out to be commercial. So we can infer that if a challenge is issued in the gulf, the odds are that the airplane is Iranian military. If we continue with our scenario, that the Vincennes had not fired and had been attacked by an F-14, the decision researchers would have still claimed that it was a dear case of bias, except this time the bias would have been to ignore the base rates, to ignore the expectancies. No one can win. If you act on expectancies and you are wrong, you are guilty of expectancy bias. If you ignore expectancies and are wrong, you are guilty of ignoring base rates and expectancies. This means that the decision bias approach explains too much (Klein, 1989). If an appeal to decision bias can explain everything after the fact, no matter what has happened, then there is no credible explanation.

I’m not sure the right base rate is the proportion of aircraft challenged, but it is still an interesting point.

In contrast to less-is-more claims, ignoring information is rarely, if ever optimal

From the abstract of an interesting paper Heuristics as Bayesian inference under extreme priors by Paula Parpart and colleagues:

Simple heuristics are often regarded as tractable decision strategies because they ignore a great deal of information in the input data. One puzzle is why heuristics can outperform full-information models, such as linear regression, which make full use of the available information. These “less-is-more” effects, in which a relatively simpler model outperforms a more complex model, are prevalent throughout cognitive science, and are frequently argued to demonstrate an inherent advantage of simplifying computation or ignoring information. In contrast, we show at the computational level (where algorithmic restrictions are set aside) that it is never optimal to discard information. Through a formal Bayesian analysis, we prove that popular heuristics, such as tallying and take-the-best, are formally equivalent to Bayesian inference under the limit of infinitely strong priors. Varying the strength of the prior yields a continuum of Bayesian models with the heuristics at one end and ordinary regression at the other. Critically, intermediate models perform better across all our simulations, suggesting that down-weighting information with the appropriate prior is preferable to entirely ignoring it. Rather than because of their simplicity, our analyses suggest heuristics perform well because they implement strong priors that approximate the actual structure of the environment.

The following excerpts from the paper (minus references) help give more context to this argument. First, what is meant by a simple heuristic as opposed to a full-information model?

Many real-world prediction problems involve binary classification based on available information, such as predicting whether Germany or England will win a soccer match based on the teams’ statistics. A relatively simple decision procedure would use a rule to combine available information (i.e., cues), such as the teams’ league position, the result of the last game between Germany and England, which team has scored more goals recently, and which team is home versus away. One such decision procedure, the tallying heuristic, simply checks which team is better on each cue and chooses the team that has more cues in its favor, ignoring any possible differences among cues in magnitude or predictive value. … Another algorithm, take-the-best (TTB), would base the decision on the best single cue that differentiates the two options. TTB works by ranking the cues according to their cue validity (i.e., predictive value), then sequentially proceeding from the most valid to least valid until a cue is found that favors one team over the other. Thus TTB terminates at the first discriminative cue, discarding all remaining cues.

In contrast to these heuristic algorithms, a full-information model such as linear regression would make use of all the cues, their magnitudes, their predictive values, and observed covariation among them. For example, league position and number of goals scored are highly correlated, and this correlation influences the weights obtained from a regression model.

So why might less be more?

Heuristics have a long history of study in cognitive science, where they are often viewed as more psychologically plausible than full-information models, because ignoring data makes the calculation easier and thus may be more compatible with inherent cognitive limitations. This view suggests that heuristics should underperform full-information models, with the loss in performance compensated by reduced computational cost. This prediction is challenged by observations of less-is-more effects, wherein heuristics sometimes outperform full-information models, such as linear regression, in real-world prediction tasks. These findings have been used to argue that ignoring information can actually improve performance, even in the absence of processing limitations. … Gigerenzer and Brighton (2009) conclude, “A less-is-more effect … means that minds would not gain anything from relying on complex strategies, even if direct costs and opportunity costs were zero”.

Less-is-more arguments also arise in other domains of cognitive science, such as in claims that learning is more successful when processing capacity is (at least initially) restricted.

The current explanation for less-is-more effects in the heuristics literature is based on the bias-variance dilemma. … From a statistical perspective, every model, including heuristics, has an inductive bias, which makes it best-suited to certain learning problems. A model’s bias and the training data are responsible for what the model learns. In addition to differing in bias, models can also differ in how sensitive they are to sampling variability in the training data, which is reflected in the variance of the model’s parameters after training (i.e., across different training samples).

A core tool in machine learning and psychology for evaluating the performance of learning models, cross-validation, assesses how well a model can apply what it has learned from past experiences (i.e., the training data) to novel test cases. From a psychological standpoint, a model’s cross-validation performance can be understood as its ability to generalize from past experience to guide future behavior. How well a model classifies test cases in cross-validation is jointly determined by its bias and variance. Higher flexibility can in fact hurt performance because it makes the model more sensitive to the idiosyncrasies of the training sample. This phenomenon, commonly referred to as overfitting, is characterized by high performance on experienced cases from the training sample but poor performance on novel test items. …

Bias and variance tend to trade off with one another such that models with low bias suffer from high variance and vice versa. With small training samples, more flexible (i.e., less biased) models will overfit and can be bested by simpler (i.e., more biased) models such as heuristics. As the size of the training sample increases, variance becomes less influential and the advantage shifts to the complex models.

So what is an alternative explanation to the performance of heuristics?

The Bayesian framework offers a different perspective on the bias-variance dilemma. Provided a Bayesian model is correctly specified, it always integrates new data optimally, striking the perfect balance between prior and data. Thus using more information can only improve performance. From the Bayesian standpoint, a less-is-more effect can arise only if a model uses the data incorrectly, for example by weighting it too heavily relative to prior knowledge (e.g., with ordinary linear regression, where there effectively is no prior). In that case, the data might indeed increase estimation variance to the point that ignoring some of the information could improve performance. However, that can never be the best solution. One can always obtain superior predictive performance by using all of the information but tempering it with the appropriate prior.

Heuristics may work well in practice because they correspond to infinitely strong priors that make them oblivious to aspects of the training data, but they will usually be outperformed by a prior of finite strength that leaves room for learning from experience. That is, the strong form of less-is-more, that one can do better with heuristics by throwing out information rather than using it, is false. The optimal solution always uses all relevant information, but it combines that information with the appropriate prior. In contrast, no amount of data can overcome the heuristics’ inductive biases.

So why have heuristics proven to be so useful? According this Bayesian argument, it is not because of a “computational advantage of simplicity per se, but rather to the fact that simpler models can approximate strong priors that are well-suited to the true structure of the environment.”

An interesting question from this work is whether our minds use heuristics as a good approximation of complex models, or whether heuristics are good approximations of more complex processes that the mind uses. The authors write:

Although the current contribution is formal in nature, it nevertheless has implications for psychology. In the psychological literature, heuristics have been repeatedly pitted against full-information algorithms that differentially weight the available information or are sensitive to covariation among cues. The current work indicates that the best-performing model will usually lie between the extremes of ordinary linear regression and fast-and-frugal heuristics, i.e., at a prior of intermediate strength. Between these extremes lie a host of models with different sensitivity to cue-outcome correlations in the environment.

One question for future research is whether heuristics give an accurate characterization of psychological processing, or whether actual psychological processing is more akin to these more complex intermediate models. On the one hand, it could be that implementing the intermediate models is computationally intractable, and thus the brain uses heuristics because they efficiently approximate these more optimal models. This case would coincide with the view from the heuristics-and-biases tradition of heuristics as a tradeoff of accuracy for efficiency. On the other hand, it could be that the brain has tractable means for implementing the intermediate models (i.e., for using all available information but down-weighting it appropriately). This case would be congruent with the view from ecological rationality where the brain’s inferential mechanisms are adapted to the statistical structure of the environment. However, this possibility suggests a reinterpretation of the empirical evidence used to support heuristics: heuristics might fit behavioral data well only because they closely mimic a more sophisticated strategy used by the mind.

There have been various recent approaches looking at the compatibility between psychologically plausible processes and probabilistic models of cognition. These investigations are interlinked with our own, and while most of that work has focused on finding algorithms that approximate Bayesian models, we have taken the opposite approach. This contribution reiterates the importance of applying fundamental machine learning concepts to psychological findings. In doing so, we provide a formal understanding of why heuristics can outperform full-information models by placing all models in a common probabilistic inference framework, where heuristics correspond to extreme priors that will usually be outperformed by intermediate models that use all available information.

The (open access) paper contains a lot more detail – and the maths – and I recommend reading it.

My latest in Behavioral Scientist: Simple heuristics that make algorithms smart

My latest contribution at Behavioral Scientist is up. Here’s an excerpt:

Modern discussions of whether humans will be replaced by algorithms typically frame the problem as a choice between humans on one hand or complex statistical and machine learning models on the other. For problems such as image recognition, this is probably the right frame. Yet much of the past success of algorithms relative to human judgment points us to a third option: the mechanical application of simple models and heuristics.

Simple models appear more powerful when removed from the minds of the human and implemented in a consistent way. The chain of evidence that simple heuristics are powerful tools, that humans use these heuristics, and that these heuristics can make us smart does not bring us to a point where these humans are outperforming simple heuristics or models consistently applied by an algorithm.

Humans are inextricably entwined in developing these algorithms, and in many cases provide the expert knowledge of what cues should be used. But when it comes to execution, taking the outputs of the model gives us a better outcome.

You can read the full article here.

A problem in the world or a problem in the model

In reviewing Michael Lewis’s The Undoing Project, John Kay writes:

Since Paul Samuelson’s Foundations of Economic Analysis, published in 1947, mainstream economics has focused on an axiomatic approach to rational behaviour. The overriding requirement is for consistency of choice: if A is chosen when B is available, B will never be selected when A is available. If choices are consistent in this sense, their outcomes can be described as the result of optimisation in the light of a well-defined preference ordering.

In an impressive feat of marketing, economists appropriated the term “rationality” to describe conformity with these axioms. Such consistency is not, however, the everyday meaning of rationality; it is not rational, though it is consistent, to maintain the belief that there are fairies at the bottom of the garden in spite of all evidence to the contrary. …

… In the 1970s, however, Kahneman and Tversky began research that documented extensive inconsistency with those rational choice axioms.

What they did, as is common practice in experimental psychology, was to set puzzles to small groups of students. The students often came up with what the economics of rational choice would describe as the “wrong” answer. These failures of the predictions of the theory clearly demand an explanation. But Lewis—like many others who have written about behavioural economics—does not progress far beyond compiling a list of these so-called “irrationalities.”

This taxonomic approach fails to address crucial issues. Is rational choice theory intended to be positive—a description of how people do in fact behave—or normative—a recommendation as to how they should behave? Since few people would wish to be labelled irrational, the appropriation of the term “rationality” conflates these perspectives from the outset. Do the observations of allegedly persistent irrationality represent a wide-ranging attack on the quality of human decision-making—or a critique of the economist’s concept of rationality? The normal assumption of economists is the former; the failure of observation to correspond with theory identifies a problem in the world, not a problem in the model. Kahneman and Tversky broadly subscribe to that position; their claim is that people—persistently—make stupid mistakes.

I have seen many presentations with an opening line of “economists assume we are rational”, quickly followed by conclusions about poor human decision-making, the two being conflated. More often than not, it’s better to ignore economics as a starting point and to simply examine the evidence for poor decision making. That evidence is, of course, much richer – and debatable – than a simple refutation of the basic economics axioms.

One of those debates concerns the Linda problem. Kay continues:

Take, for example, the famous “Linda Problem.” As Kahneman frames it: “Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which of the following is more likely? ‘Linda is a bank teller,’ ‘Linda is a bank teller and is active in the feminist movement.’”

The common answer is that the second alternative—that Linda is more likely to be a feminist bank teller than a bank teller—is plainly wrong, because the rules of probability state that a compound probability of two events cannot exceed the probability of either single event. But to the horror of Kahneman and his colleagues, many people continue to assert that the second description is the more likely even after their “error” is pointed out.

But it does not require knowledge of the philosopher Paul Grice’s maxims of conversation—although perhaps it helps—to understand what is going on here. The meaning of discourse depends not just on the words and phrases used, but on their context. The description that begins with Linda’s biography and ends with “Linda is a bank teller” is not, without more information, a satisfactory account. Faced with such a narrative in real life, one would seek further explanation to resolve the apparent incongruity and, absent of such explanation, be reluctant to believe, far less act on, the information presented.

Kahneman and Tversky recognised that we prefer to tell stories than to think in terms of probability. But this should not be assumed to represent a cognitive failure. Storytelling is how we make sense of a complex world of which we often know, and understand, little.

So we should be wary in our interpretation of the findings of behavioural economists. The environment in which these experiments are conducted is highly artificial. A well-defined problem with an identifiable “right” answer is framed in a manner specifically designed to elucidate the “irrationality” of behaviour that the experimenter triumphantly identifies. This is a very different exercise from one which demonstrates that people make persistently bad decisions in real-world situations, where the issues are typically imperfectly defined and where it is often not clear even after the event what the best course of action would have been.

Kay also touches on the more general criticisms:

Lewis’s uncritical adulation of Kahneman and Tversky gives no credit to either of the main strands of criticism of their work. Many mainstream economists would acknowledge that people do sometimes behave irrationally, but contend that even if such irrationalities are common in the basements of psychology labs, they are sufficiently unimportant in practice to matter for the purposes of economic analysis. At worst, a few tweaks to the standard theory can restore its validity.

From another perspective, it may be argued that persistent irrationalities are perhaps not irrational at all. We cope with an uncertain world, not by attempting to describe it with models whose parameters and relevance we do not know, but by employing practical rules and procedures which seem to work well enough most of the time. The most effective writer in this camp has been the German evolutionary psychologist Gerd Gigerenzer, and the title of one of his books, Simple Heuristics That Make Us Smart, conveys the flavour of his argument. The discovery that these practical rules fail in some stylised experiments tells us little, if anything, about the overall utility of Gigerenzer’s “fast and frugal” rules of behaviour.

Perhaps it is significant that I have heard some mainstream economists dismiss the work of Kahneman in terms not very different from those in which Kahneman reportedly dismisses the work of Gigerenzer. An economic mainstream has come into being in which rational choice modelling has become an ideology rather than an empirical claim about the best ways of explaining the world, and those who dissent are considered not just wrong, but ignorant or malign. An outcome in which people shout at each other from inside their own self-referential communities is not conducive to constructive discourse.

The Rhetoric of Irrationality

From the opening of Lola Lopes’s 1991 article The Rhetoric of Irrationality (pdf) on the heuristics and biases literature:

Not long ago, Newsweek ran a feature article describing how researchers at a major midwestern business school are exploring the process of choice in hopes of helping business executives and business students improve their ‘often rudimentary decision-making skills’

[T]he researchers have, in the author’s words, ‘sadly’ concluded that ‘most people’ are ‘woefully muddled information processors who stumble along ill-chosen shortcuts to reach bad conclusions’. Poor ‘saps’ and ‘suckers’ that we are, a list of our typical decision flaws would be so lengthy as to ‘demoralize’ Solomon.

This is a powerful message, sweeping in its generality and heavy in its social and political implications. It is also a strange message, for it concerns something that we might suppose could not be meaningfully studied in the laboratory, that being the fundamental adequacy or inadequacy of people’s capacity to choose and plan wisely in everyday life. Nonetheless, the message did originate in the laboratory, in studies that have no greater claim to relevance than hundreds of others that are published yearly in scholarly journals. My goal of this article is to trace how this message of irrationality has been selected out of the literature and how it has been changed and amplified in passing through the logical and expository layers that exist between experimental conception and popularization.

Below are some of the more interesting passages. First:

Prior to 1970 or so, most researchers in judgment and decision-making believed that people are pretty good decision-makers. In fact, the most frequently cited summary paper of that era was titled ‘Man as an intuitive statistician’ (Peterson & Beach, 1967). Since then, however, opinion has taken a decided turn for the worse, though the decline was not in any sense demanded by experimental results. Subjects did not suddenly become any less adept at experimental tasks nor did experimentalists begin to grade their performance against a tougher standard. Instead, researchers began selectively to emphasize some results at the expense of others.

The Science article [Kahneman and Tversky’s 1974 article (pdf)] is the primary conduit through which the laboratory results made their way our of psychology and into other branches of the social sciences. … About 20 percent of the citations were in sources outside psychology. Of these, all used the citation to support the unqualified claim that people are poor decision-makers.

Acceptance of this sort is not the norm for psychological research. Scholars from other fields in the social sciences such as sociology, political science, law, economics, business and anthropology look with suspicion on the tightly controlled experimental tasks that psychologists study in the laboratories, particularly when the studies are carried out using student volunteers. In the case of the biases and heuristics literature, however, the issue of generalizability is seldom raised and it is rarely so much as mentioned that the cited conclusions are based on laboratory research. Human incompetence is presented as a fact, like gravity.

If you think of it, this is a great trick, for the studies in question have managed to shed their experimental details without sacrificing scientific authority. Somehow the message of irrationality has been sprung free of its factual supports, allowing it to be seen entire, unobstructed by the hopeful assumptions and tedious methodologies that brace up all laboratory research.

One interesting thread concerns the purpose of the experiments and the contrasting conclusions drawn from them. For this discussion, Lopes looks at six of the experiments in four of Kahneman and Tversky papers published between 1971 and 1973, plus a summary article in Science from 1974. One example involved this question:

Consider the letter R. Is R more likely to appear in the first position of a word or the third position of a word?

This problem involves the availability heuristic, the tendency to estimate the probability of an event by the ease with which instances of the event can be remembered or constructed in the imagination. Under the availability hypothesis, people will see how many words they can generate with R in the first or third position. It is easier to think of words with R in the first position than the third, leading them to conclude – in error – that R is more common in the first.

Lopes writes:

[T]he question is posed so that there are only two possible results. One of these will occur if the subject reasons in accord with probability theory, and the other, if the subject reasons heuristically. …

By this logic, the implications of Figure 1 [a summary of the results] are clear: subjects reason heuristically and not according to probability theory. That is the result, signed, sealed and delivered, courtesy of strong inference. But the main contribution of the research is not this result since few would have supposed that naive people know much about combinations or variances of binomial proportions or how often R appears in the third position of words. Instead, the research commands attention and respect because the various problems function as thought experiments, strengthening our grasp of the task domain by revealing critical psychological variables that do not show up in the normative analysis. …

There is, however, another way to construe this set of studies and that is by considering the predictions of the two processing modes at a higher level of abstraction. If we think about performance in terms of correctness, we see that in every case the probability mode predicts correct answers and the heuristic mode predicts errors. … [T]he sheer weight of all the wrong answers tend to deform the basic conclusion, bending it away from an evaluatively neutral description of the process and toward something more like ‘people use heuristics to judge probabilities and they are wrong’, or even ‘people make mistakes when they judge probabilities because they use heuristics’.

Happily, conclusions like these do not hold up. This is because the tuning that is necessary for constructing problems that allow strong inference on processing questions is systematically misleading when it comes to asking evaluative questions. For example, consider the letter R problem. Why was R chosen for study and not, say, B? … Of the 20 possible consonants, 12 are more common in the first position and 8 are more common in the third position. All of the consonants that Kahneman and Tversky studied were taken from the third-position group even though there are more consonants in the first-position group.

The selection of consonants was not malicious. Their use is dictated by the strong inference logic since only they yield unambiguous answers to the processing question. In other words, when a subject says that R occurs more frequently in the first position, we know that he or she must be basing the judgment on availability, since the actual frequency information would lead to the opposite conclusion. Had we used B, instead, and had the subject also judged it to occur more often in the first position, we would not be able to tell whether the judgment reflect availability or factual knowledge since B is, in fact, more likely to occur in the first position.

We see, then, that the experimental logic constrains the interpretation of the data. We can conclude that people use heuristics instead of probability theory but we cannot conclude that their judgments are generally poor. All the same, it is the latter, unwarranted conclusion that is most often conveyed by this literature, particularly in settings outside psychology.

Lopes then turns her attention onto Kahneman and Tversky’s famous Science article.

In the original experimental reports, there is plenty of language to suggest that human judgments are often wrong, but the exposition focuses mostly on the delineation of process. In the Science article, however, Tversky and Kahneman (1974) shift their attention from heuristic processing to biased processing. In the introduction they tell us: ‘This article shows that people rely on a limited number of heuristic principles which reduce the complex tasks of assessing probabilities and predicting values to simpler judgmental operations’ (p. 1124). By the time we get to the discussion, however, the emphasis has changed. Now they say: ‘This article has been concerned with cognitive biases that stem from the reliance on judgmental heuristics’ (p. 1130).

Examination of the body of the paper shows that the retrospective account is the correct one: the paper is more concerned with biases than with heuristics even though the experiments bear more on heuristics than on biases.

There is plenty more of interest in Lopes’s article. I recommend reading the full article (pdf).

Genoeconomics and designer babies: The rise of the polygenic score

When genome-wide association studies (GWAS) were first used to study complex polygenic traits, the results were underwhelming. Few genes with any predictive power were found, and those that were typically explained only a fraction of the genetic effects that twin studies suggested were there.

This led to divergent responses, ranging from continued resistance to the idea that genes affect anything, to a quiet confidence that once sample sizes became large enough those genetic effects would be found.

Increasingly large samples are now showing that the quiet confidence was justified, with a steady flow of papers emerging finding material genetic effects on traits including educational attainment, intelligence and height.

One source of this work are “genoeconomists”. From Jacob Ward in the New York Times:

Once a G.W.A.S. shows genetic effects across a group, a “polygenic score” can be assigned to individuals, summarizing the genetic patterns that correlate to outcomes found in the group. Although no one genetic marker might predict anything, this combined score based on the entire genome can be a predictor of all sorts of things. And here’s why it’s so useful: People outside that sample can then have their DNA screened, and are assigned their own polygenic score, and the predictions tend to carry over. This, Benjamin realized, was the sort of statistical tool an economist could use.

As an economist, however, Benjamin wasn’t interested in medical outcomes. He wanted to see if our genes predict social outcomes.

In 2011, with a grant from the National Science Foundation, Benjamin launched the Social Science Genetic Association Consortium, an unprecedented effort to gather unconnected genetic databases into one enormous sample that could be studied by researchers from outside the world of genetic science. In July 2018, Benjamin and four senior co-authors, drawing on that database, published a landmark study in Nature Genetics. More than 80 authors from more than 50 institutions, including the private company 23andMe, gathered and studied the DNA of over 1.1 million people. It was the largest genetics study ever published, and the subject was not height or heart disease, but how far we go in school.

The researchers assigned each participant a polygenic score based on how broad genetic variations correlated with what’s called “educational attainment.” (They chose it because intake forms in medical offices tend to ask patients what education they’ve completed.) The predictive power of the polygenic score was very small — it predicts more accurately than the parents’ income level, but not as accurately as the parents’ own level of educational attainment — and it’s useless for making individual predictions.

One of the most interesting possibilities for using polygenic scores is to use them to control for heterogeneity in research subjects. Ward writes:

Several researchers involved in the project mentioned to me the possibility of using polygenic scores to sharpen the results of studies like the ongoing Perry Preschool Project, which, starting in the early 1960s, began tracking 123 preschool students and suggested that early education plays a large role in determining a child’s success in school and life. Benjamin and other co-authors say that perhaps sampling the DNA of the Perry Preschool participants could improve the accuracy of the findings, by controlling for those in the group that were genetically predisposed to go further in school.

In a world with easy access to genetic samples, it could become common to include genetic controls in analysis of interesting societal outcomes, in the same way we now control for parental traits.

A couple of times in the article, Ward notes that “scores aren’t individually predictive”. He writes that “The predictive power of the polygenic score was very small — it predicts more accurately than the parents’ income level, but not as accurately as the parents’ own level of educational attainment — and it’s useless for making individual predictions.”

I’m not sure what Ward’s definition of “predictive” is for an individual, but take this example from the article:

The authors calculated, for instance, that those in the top fifth of polygenic scores had a 57 percent chance of earning a four-year degree, while those in the bottom fifth had a 12 percent chance. And with that degree of correlation, the authors wrote, polygenic scores can improve the accuracy of other studies of education.

That looks like predictive power to me. Take an individual from the sample or an equivalent population, look at their polygenic score, and then assign a probability of whether they will obtain a four-year degree.

I recommend reading the whole article.

A related story getting ample press is that Genomic Prediction has started to offer intelligence screening for embryos. Polygenic scores have been used with success in livestock breeding for a while now, which is often a better place to look for evidence of the future possibilities than listening to those afraid of the human implications of genetic research. From The Guardian:

The company says it is only offering such testing to spot embryos with an IQ low enough to be classed as a disability, and won’t conduct analyses for high IQ. But the technology the company is using will permit that in principle, and co-founder Stephen Hsu, who has long advocated for the prediction of traits from genes, is quoted as saying: “If we don’t do it, some other company will.”

The development must be set, too, against what is already possible and permitted in IVF embryo screening. The procedure called pre-implantation genetic diagnosis (PGD) involves extracting cells from embryos at a very early stage and “reading” their genomes before choosing which to implant. It has been enabled by rapid advances in genome-sequencing technology, making the process fast and relatively cheap. In the UK, PGD is strictly regulated by the Human Fertilisation and Embryology Authority (HFEA), which permits its use to identify embryos with several hundred rare genetic diseases of which the parents are known to be carriers. PGD for other purposes is illegal.

In the US it’s a very different picture. Restrictive laws about what can be done in embryo and stem-cell research using federal funding sit alongside a largely unregulated, laissez-faire private sector, including IVF clinics. PGD to select an embryo’s sex for “family balancing” is permitted, for example. There is nothing in US law to prevent PGD for selecting embryos with “high IQ”.

Ball also expresses a scepticism about the value of the polygenic scores:

These relationships are, however, statistical. If you have a polygenic score that places you in the top 10% of academic achievers, that doesn’t mean you will ace your exams without effort. Even setting aside the substantial proportion of intelligence (typically around 50%) that seems to be due to the environment and not inherited, there are wide variations for a given polygenic score, one reason being that there’s plenty of unpredictability in brain wiring during growth and development.

So the service offered by Genomic Prediction, while it might help to spot extreme low-IQ outliers, is of very limited value for predicting which of several “normal” embryos will be smartest. Imagine, though, the misplaced burden of expectation on a child “selected” to be bright who doesn’t live up to it. If embryo selection for high IQ goes ahead, this will happen.

Despite Ball’s scepticism about comparing “normal” embryos, I expect it won’t be long before Genomic Prediction or a counterpart is doing just that.

Steve Hsu, co-founder of Genomic Prediction, comments on the press here (and provides some links to other articles). He closes by saying:

“Expert” opinion seems to have evolved as follows:

1. Of course babies can’t be “designed” because genes don’t really affect anything — we’re all products of our environment!

2. Gulp, even if genes do affect things it’s much too complicated to ever figure out!

3. Anyone who wants to use this technology (hmm… it works) needs to tread carefully, and to seriously consider the ethical issues.

Only point 3 is actually correct, although there are still plenty of people who believe 1 and 2 :-(

How happy is a paraplegic a year after losing the use of their legs?

From Dan Gilbert’s 2004 TED talk, now viewed over 16 million times:

Let’s see how your experience simulators are working. Let’s just run a quick diagnostic before I proceed with the rest of the talk. Here’s two different futures that I invite you to contemplate. You can try to simulate them and tell me which one you think you might prefer. One of them is winning the lottery. This is about 314 million dollars. And the other is becoming paraplegic.

Just give it a moment of thought. You probably don’t feel like you need a moment of thought.

Interestingly, there are data on these two groups of people, data on how happy they are. And this is exactly what you expected, isn’t it? But these aren’t the data. I made these up!

These are the data. You failed the pop quiz, and you’re hardly five minutes into the lecture. Because the fact is that a year after losing the use of their legs, and a year after winning the lotto, lottery winners and paraplegics are equally happy with their lives.

And here’s Dan Gilbert reflecting on this statement 10 years later:

The first mistake occurred when I misstated the facts about the 1978 study by Brickman, Coates and Janoff-Bulman on lottery winners and paraplegics.

At 2:54 I said, “… a year after losing the use of their legs, and a year after winning the lotto, lottery winners and paraplegics are equally happy with their lives.” In fact, the two groups were not equally happy: Although the lottery winners (M=4.00) were no happier than controls (M=3.82), both lottery winner and controls were slightly happier than paraplegics (M=2.96).

So why has this study become the poster child for the concept of hedonic adaptation? First, most of us would expect lottery winners to be much happier than controls, and they weren’t. Second, most of us would expect paraplegics to be wildly less happy than either controls or lottery winners, and in fact they were only slightly less happy (though it is admittedly difficult to interpret numerical differences on rating scales like the ones used in this study). As the authors of the paper noted, “In general, lottery winners rated winning the lottery as a highly positive event, and paraplegics rated their accident as a highly negative event, though neither outcome was rated as extremely as might have been expected.” Almost 40 years later, I suspect that most psychologists would agree that this study produced rather weak and inconclusive findings, but that the point it made about the unanticipated power of hedonic adaptation has now been confirmed by many more powerful and methodologically superior studies. You can read the original study here.

It’s great that he is able to step back and admit his mistakes. One thing that perplexes me, however, is that he purports to show the real data on a slide:


As you can see, this runs on a scale reaching up to 70, with both measured at 50. The actual measure was on a 5-point scale. Where did these numbers come from? Did Gilbert simply make these data up?

If this were just a case of misstating the point of the study, I would feel much sympathy. As he states:

When I gave this talk in 2004, the idea that videos might someday be “posted on the internet” seemed rather remote. There was no Netflix or YouTube, and indeed, it would be two years before the first TED Talk was put online. So I thought I was speaking to a small group of people who’d come to a relatively unknown conference in Monterey, California, and had I realized that ten years later more than 8 million people would have heard what I said that day, I would have (a) rehearsed and (b) dressed better.

That’s a lie. I never dress better. But I would have rehearsed. Back then, TED talks were considerably less important events and therefore a lot more improvisational, so I just grabbed some PowerPoint slides from previous lectures, rearranged them on the airplane to California, and then took the stage and winged it. I had no idea that on that day I was delivering the most important lecture of my life.

But if that chart was made up, my sympathy somewhat fades away.

How likely is “likely”?

From Andrew Mauboussin and Michael Mauboussin:

In a famous example (at least, it’s famous if you’re into this kind of thing), in March 1951, the CIA’s Office of National Estimates published a document suggesting that a Soviet attack on Yugoslavia within the year was a “serious possibility.” Sherman Kent, a professor of history at Yale who was called to Washington, D.C. to co-run the Office of National Estimates, was puzzled about what, exactly, “serious possibility” meant. He interpreted it as meaning that the chance of attack was around 65%. But when he asked members of the Board of National Estimates what they thought, he heard figures from 20% to 80%. Such a wide range was clearly a problem, as the policy implications of those extremes were markedly different. Kent recognized that the solution was to use numbers, noting ruefully, “We did not use numbers…and it appeared that we were misusing the words.”

Not much has changed since then. Today people in the worlds of business, investing, and politics continue to use vague words to describe possible outcomes.

To examine this problem in more depth, team Mauboussin asked 1700 people to attach probabilities to a range of words or phrases. For instance, if a future event is likely to happen, what percentage of the time would you estimate it ends up happening? Or what if the future event has a real possibility of happening?

Unsurprisingly, the answers are all over the place. The HBR article has a nice chart of the distribution of responses, and you see more detailed results here. (You can also take the survey there too).

What is the range of answers for an event that is “likely”? The 90% probability range for “likely” – that is the range that 90% of the answers fell within (and 5% of the answers were above, and 5% below) was 55% to 90%. “Real possibility” had a probability range of between 20% and 80% – the phrase in near meaningless. Even “always” is ambiguous, with a probability range of 90% to 100%.

An interesting finding of the survey was that men and women differ in their interpretations. Women are more likely to take a phrase as indicating a higher probability.

So what does team Mauboussin suggest we should do? Use numbers. Pin down those subjective probabilities using objective benchmarks. Practice.

And to close with another piece of Sherman Kent wisdom:

Said R. Jack Smith:  Sherm, I don’t like what I see in our recent papers. A 2-to-1 chance of this; 50-50 odds on that. You are turning us into the biggest bookie shop in town.

Replied Kent:  R.J., I’d rather be a bookie than a [blank-blank] poet.

Avoiding trite lists of biases and pictures of human brains on PowerPoint slides

From a book chapter by Greg Davies and Peter Brooks, Practical Challenges of Implementing Behavioral Finance: Reflections from the Field (quotes taken from a pre-print):

Taken in isolation, the ideas and concepts that comprise the field of behavioral finance are of very little practical use. Indeed, many of the attempts to apply these ideas amount to little more than a trite list of biases and pictures of human brains on PowerPoint slides. Talking a good game in the arena of behavioral finance is easy, which often leads to the misperception that it is superficial. Yet, making behavioral finance work in practice is much more challenging: it requires integrating these ideas with working models, information technology (IT) systems, business processes, and organizational culture.

Substitute the word “behavioural finance” with “behavioural economics” and its kin, and the message reads the same.

On the “bias” bias:

Today, extremely long lists of biases are available, which do little to convey the underlying sophistication, complexity, and thoroughness of more than half a century of highly robust experimental and theoretical work. These lists provide no real framework for potential practitioners to deploy when approaching a tangible problem. And many of these biases appear to overlap or conflict with each other, which can make behavioral finance appear either very superficial or highly confused.

The easily accessible examples that academics have used to illustrate these biases to wide audiences have sometimes led to the impression that behavioral economics is an easy field to master. This misrepresentation leads to inevitable disappointment when categorizing biases proves not to be an easy panacea. A perception of the field as “just anecdotes and parlor games” reduces the willingness of the commercial world to put substantial investments of time and resource into building applications grounded on the underlying ideas. Building behavioral finance ideas into commercial applications requires both depth and breadth of understanding of the theory and, in many cases, large resource commitments.

On whether there is a grand unified theory:

A commonly expressed concern, at least in the mainstream press, is that there exists no grand unified theory of behavioral economics, and that the field is thus merely a chaotic collection of unconnected and often contradictory findings. For the purpose of practical implementation, the notion that this is, or needs to be, a clearly defined field should be eliminated, reducing the desire to erode it with arbitrary labels and definitions. Human behavior operates at multiple levels from the neurological to complex social interactions. Any quest for a grand unified theory to mirror that of physical sciences may well be entirely misguided, together with the notion that such a theory is necessary for the broad field to be useful. Much more effective is an approach of treating the full range of behavioral findings as a rich toolbox that can be applied to, and tested on, a range of practical concerns.

On the superficial application:

The first major challenge is that behavioral finance is not particularly effective if applied superficially. Yet, superficial attempts are commonplace. Some seek to do little more than offer a checklist of biases, hoping that informing people of poor decision-making can solve the problem. Instead, a central theme of decision science is the consistent finding that merely informing people of their adverse behavioral proclivities is very seldom effective in combating them.

Because behavioral finance is both topical and fascinating to many people, it attracts ‘hobbyists’ who can readily recite a number of biases, but who neither have the depth of knowledge of the field overall, nor a solid grasp of the theoretical underpinnings of the more technical aspects of the field. …

This chapter is not an attempt to erect barriers to entry amongst behavioral practitioners and claim that only those with advanced degrees in the field should be taken seriously. On the contrary, the effect of greater academic training can cause its beneficiaries to hold on too closely to narrow and technical interpretations of the field to make them effective practitioners. Indeed, some of the most effective practitioners do not have an extensive academic background in the field. However, they have invested considerable time and effort getting to know and deeply understand the breadth and depth of the field.

And on naive buyers:

Limited study of behavioral finance through reading the popular books on the topic may equip one to sound knowledgeable and appear convincing. However, as a relatively new field, the purchasers of behavioral expertise are seldom equipped to know the difference and may be unable to tell a superficially convincing approach from approaches that embody true understanding. This leaves the field open to consultants peddling ‘behavioral expertise’ but having in their toolkit little more than a list of biases that they apply sequentially and with little variation to each problem encountered. Warning flags should go up whenever the proposal rests heavily on catalogues of behavioral biases or contains a preponderance of pictures of brains.

Chris Voss’s Never Split the Difference: Negotiating as if your life depended on it

Summary: Interesting ideas on how to approach negotiation, but I don’t know how much weight to give them. How much expertise could be developed in hostage negotiations? Can that expertise be distilled into principles, or is much of it tacit knowledge?

Chris Voss’s Never Split the Difference: Negotiating as if your life depended on it (written with Tahl Raz) is a distillation of Voss’s approach to negotiation, developed through 15 years negotiating hostage situations for the FBI. Voss was the FBI’s lead international kidnapping negotiator, and for the last decade he has run a consulting firm that guides organisations through negotiations.

I am not sure how I should rate the book. There are elements I like, elements that seem logical, and yet a sense that much is just storytelling. I don’t know enough of the negotiation literature to understand what other support there might be for Voss’s approach – and Voss generally doesn’t draw on the literature – so it is not clear what weight I should give to his arguments.

Voss’s central thread is that we should not approach negotiation as though it is a purely rational exercise. No matter how you frame the negotiation in advance, there is no escaping the humans that will be engaging in that negotiation.

This argument seems obvious, as in many negotiations you will be dealing with emotional people. Yet a flip through some of the classic negotiating texts, such as Getting to Yes, shows that the consideration of emotion is often shallow. Emotion is largely discussed as something to be overcome so that a mutually beneficial deal can be reached.

A deeper level to understanding the role emotion is to see how integral it is to the negotiating process. Emotion and decision-making cannot be disentangled.

In the opening chapter, Voss links this need to consider emotions to the work of Daniel Kahneman and Amos Tversky (unfortunately described as University of Chicago professors who discovered more than 150 cognitive biases). Voss draws on Kahneman’s distinction between the two modes of thought described in Thinking, Fast and Slow: the fast, instinctive and emotional System 1, and the slow, deliberative and logical System 2. If you go into a negotiation with all the tools to deal with System 2 without the tools to read, understand and manipulate System 1, you were trying to make an omelette without cracking an egg.

Despite being prominent in the opening, Kahneman and Tversky’s work is only briefly considered in other parts of the book, mainly in one chapter that includes examination of anchoring and loss aversion. By manipulating someone’s reference point and capitalising on their fear of loss, you can shift the terms of what they will agree to.

For instance, Voss suggests that you might initially anchor the other side’s expectations through an “accusation audit”, whereby you list every terrible thing the other side could say about you in advance. You then create a frame so that the agreement is to avoid loss. Putting those together, you might start out by saying that you have a horrible deal for them, but still want to bring it to them before you give it to somebody else. By taking the sting out of the low offer and framing acceptance of that offer as an opportunity to avoid loss, you might induce acceptance.

Voss also discusses the idea of setting a very high or low anchor early in negotiations, although he notes that this comes at a cost. It might be effective against the inexperienced, but you lose the opportunity of learning from the other side when they go first. If prepared, you can resist their anchor, and if you are in a low information environment, you might be pleasantly surprised.

Voss recognises the human desire for fairness in another important factor. While Voss draws on the academic literature to demonstrate that desire, his proposed approaches to fairness in negotiation are not put in the context of that literature. As a result, I don’t have much of a grip on whether his ideas – such as avoiding accusations of unfairness, and giving the other side permission to stop you at any time of they feel you are being unfair – are effective. It’s polite, sounds reasonable, but does it work?

The concept that gets the most attention in the book is tactical empathy. This involves active listening, with tools such as mirroring (repeating the last few words someone said to induce them to keep explaining), labelling (giving a name to their feelings) and summarising their position back to them. I am partial to these ideas. By listening, you can learn a lot. I have always found that simple repetition of concepts, whether through mirroring, labelling or summarising, are powerful tools to get people to open up and to understand their position.

Another thread to the book is the idea of saying no without saying no, generally through the use of calibrated questions. Calibrated questions are questions with no fixed answer, and that can’t be answered with a yes or no. They typically start with “how” or “what”, rather than “is” or “does”. They can be used to give the other side the illusion of control while at the same time pushing them to think about solving your problem. If the price is higher than you want to pay, you might say “how am I supposed to pay that?” Calibrated questions also have broader use through the negotiation to learn more from your counterpart.

Ideas such as this seem attractive, but I don’t know how much weight I should put on Voss’s arguments. This is largely because I don’t how much expertise you could develop in hostage negotiation, and the degree to which that expertise is tacit knowledge. Voss notes that his expertise is built from experience, not from textbooks, and that his approach is designed for the real world. Can a human build skills for this real world? Is there rapid feedback on decisions, with an opportunity to learn?

In one sense there is feedback, with the hostages released or not, and the terms of that release known. But each negotiation would involve a multitude of decisions and factors. Conversations might extend for days or weeks. How effectively can you isolate the cause of the outcome? How stable is that cause-effect relationship across different negotiations?

In a podcast episode with Sam Harris, Voss mentioned that he had been involved around 150 hostage negotiations around the world. That would seem a fair number to start to be able to identify patterns, particularly if you consider that through a negotiation there might be many smaller opportunities of feedback, such as extracting information. But as Voss’s stories through the book show, these negotiations span across many different countries and contexts. How many of those elements are common and stable enough for true expertise to develop? Most of his experience involved international kidnapping – a commodity business involving financial transactions. Can the lessons from these be applied elsewhere?

Voss (and the FBI more generally) would have had a broader range of examples to draw on, and Voss’s more recent experience in consulting on negotiation could provide further opportunities to develop expertise. But it’s not obvious how that experience is incorporated into expertise that in turn can be effectively distilled into a book.