# What can we infer about someone who rejects a 50:50 bet to win $110 or lose$100? The Rabin paradox explored

Consider the following claim:

We don’t need loss aversion to explain a person’s decision to reject a 50:50 bet to win $110 or lose$100. That just simple risk aversion as in expected utility theory.

Risk aversion is the concept that we prefer certainty to a gamble with the same expected value. For example, a risk averse person would prefer $100 for certain over a 50-50 gamble between$0 and $200, which has an expected value of$100. The higher their risk aversion, the less they would value the 50:50 bet. They would also be willing to reject some positive expected value bets.

Loss aversion is the concept that losses loom larger than gains. If the loss is weighted more heavily that the gain – it is often said that losses hurt twice as much as gains bring us joy – then this could also explain the decision to reject a 50:50 bet of the type above. Loss aversion is distinct from risk aversion as its full force applies to the first dollar either side of the reference point from which the person is assessing the change (and at which point risk aversion should be negligible).

So, do we need loss aversion to explain the rejection of this bet, or does risk aversion suffice?

One typical response to the above claim is loosely based on the Rabin Paradox, which comes from a paper published in 2000 by Matthew Rabin:

An expected utility maximiser who rejects this bet is exhibiting a level of risk aversion that would lead them to reject bets that no one in their right mind would reject. It can’t be the case that this is simply risk aversion.

For the remainder of this post I am going to pull apart Rabin’s argument from his justifiably famous paper Risk Aversion and Expected-Utility Theory: A Calibration Theorem (pdf). A more more readable version of this argument was also published in 2001 in an article by Rabin and Richard Thaler.

To understand Rabin’s point, I have worked through the math in his paper. You can see my mathematical workings in an Appendix at the bottom of this post. There were quite a few minor errors in the paper – and some major errors in the formulas – but I believe I’ve captured the crux of the argument. (I’d be grateful for some second opinions on this).

I started working through these two articles with an impression that Rabin’s argument was a fatal blow to the idea that expected utility theory accurately describes the rejection of bets such as that above. I would have been comfortable making the above response. However, after playing with the numbers and developing a better understanding of the paper, I would say that the above response is not strictly true. Rabin’s paper makes an important point, but it is far from a fatal blow by itself. (That fatal blow does come, just not solely from here.)

Describing Rabin’s argument

Rabin’s argument starts with a simple bet: suppose you are offered a 50:50 bet to win $110 or lose$100, and you turn it down. Suppose further that you would reject this bet no matter what your wealth (this is an assumption we will turn to in more detail later). What can you infer about your response to other bets?

This depends on what decision making model you are using.

For an expected utility maximiser – someone who maximises the probability weighted subjective value of these bets – we can infer that they will turn down any 50:50 bet of losing $1,000 and gaining any amount of money. For example, they would reject a 50:50 bet to lose$1,000, win one billion dollars.

On its face value, that is ridiculous, and that is the crux of Rabin’s argument. Rejection of the low value bet to win $110 and lose$100 would lead to absurd responses to higher value bets. This leads Rabin to argue that risk aversion or the diminishing value of money has nothing to do with rejection of the low value bets.

The intuition behind Rabin’s argument is relatively simple. Suppose we have someone that rejects a 50:50 bet for gain $11, lose$10. They are an expected utility maximiser with a weakly concave utility curve: that is, they are risk neutral or risk averse at all levels of wealth.

From this, we can infer that they weight the average of each dollar between their current wealth (W) and their wealth if they win the bet (W+11) only 10/11 as much as they weight the average dollar of the last $10 of their current wealth (between W-10 and W). We can also say that they therefore weight their W+11th dollar at most 10/11 as much as their W-10th dollar (relying on the weak concavity here). Suppose their wealth is now W+21. We have assumed that they will reject the bet at all levels of wealth, so they will also reject at this wealth. Iterating the previous calculations, we can say that they will weight their W+32nd dollar only 10/11 as much as their W+11th dollar. This means they value their W+32nd dollar only (10/11)2 as much as their W-10th dollar. Keep iterating in this way and you end up with some ridiculous results. You value the 210th dollar above your current wealth only 40% as much as your last current dollar of your wealth [reducing by a constant factor of 10/11 every$21 – (10/11)10]. Or you value the 900th dollar above your current wealth at only 2% of your last current dollar [(10/11)40]. This is an absurd rate of discounting.

Those numbers are from the 2001 Rabin and Thaler paper. In his 2000 paper, Rabin gives figures of 3/20 for the 220th and 1/2000 for the 880th dollar, effectively calculating (10/11)20 and (10/11)80, which is a reduction by a factor of 10/11 every 11 dollars. This degree of discounting could be justified and reflects the equations provided in the Appendix to his paper, but it requires a slightly different intuition than the one relating to the comparison between every 21st dollar. If instead you note that the $11 above a reference point are valued less than the$10 below, you only need iterate up $11 to get another discount of 10/11, as the next$11 is valued at most as much as the previous $10. Regardless of whether you use the numbers from the 2000 or 2001 paper, taking this iteration to the extreme, it doesn’t take long for additional money to have effectively zero value. Hence the result, reject the 50:50 win$110, lose $100 and you’ll reject the win any amount, lose$1,000 bet.

What is the utility curve of this person?

This argument sounds compelling, but we need to examine the assumption that you will reject the bet at all levels of wealth.

If someone rejects the bet at all levels of wealth, what is the least risk averse they could be? They would be close to indifferent to the bet at all levels of wealth. If that was the case across the whole utility curve, their absolute level of risk aversion is constant.

The equation used to represent utility with constant absolute risk aversion is exponential utility (with a>0). A feature of the exponential utility function is that, for a risk averse person, utility caps out at a maximum. Beyond a certain level of wealth, they gain no additional utility – hence Rabin’s ability to define bets where they reject infinite gains.

The need for utility to cap out is also apparent from the fact that someone might reject a bet that involves the potential for infinite gain. The utility of infinite wealth cannot be infinite, as any bet involving that the potential for infinite utility would be accepted.

In the 2000 paper, Rabin brings the constant absolute risk aversion function into his argument more explicitly when he examines what proportion of their portfolio a person with an exponential utility function would invest in stocks (under some particular return assumptions). There he shows a ridiculous level of risk aversion and states that “While it is widely believed that investors are too cautious in their investment behavior, no one believes they are this risk averse.”

However, this effective (or explicit) assumption of constant absolute risk aversion is not particularly well grounded. Most empirical evidence is that people exhibit decreasing absolute risk aversion, not constant. Exponential utility functions are used more for mathematical tractability than for realistically reflecting the decision making processes that people use.

Yet, under Rabin’s assumption of rejecting the bet at all levels of wealth, constant absolute risk aversion and a utility function such as the exponential is the most accommodating assumption we can make. While Rabin states that “no one believes they are this risk averse”, it’s not clear that anyone believes Rabin’s underlying assumption either.

This ultimately means that the ridiculous implications for rejecting low-value bets is the result of Rabin’s unrealistic assumption of rejecting the bet no matter what their wealth.

Relaxing the “all levels of wealth” assumption

Rabin is, of course, aware that the assumption of rejecting the bet at all levels of wealth is a weakness, so he provides a further example that applies to someone who only rejects this bet for all levels of wealth below $300,000. This generates less extreme, but still clearly problematic bets that the bettor can be inferred to also reject. For example, consider someone who rejects the 50:50 bet to win$110, lose $100 when they have$290,000 of wealth, and who would also reject that bet up to a wealth of $300,000. As for the previous example, each time you iterate up$110, each dollar in that $110 is valued at most 10/11 of the previous$110. It takes 90 iterations of $110 to cover that$10,000, meaning that a dollar around wealth $300,000 will be valued only (10/11)90 (0.02%) of a dollar at wealth$290,000. Each dollar above $300,000 is not discounted any further, but by then the damage has already been done, with that money of almost no utility. For instance, this person will reject a bet of gain$718,190, lose $1,000. Again, this person would be out of their mind. You might now ask whether a person with a wealth of$290,000 to $300,000 actually rejects bets of this nature? If not, isn’t this just another unjustifiable assumption designed to generate a ridiculous result? It is possible to make this scenario more realistic. Rabin doesn’t mention this in his paper (nor do Rabin and Thaler), but we can generate the same result at much lower levels of wealth. All we need to find is someone who will reject that bet over a range of$10,000, and still have enough wealth to bear the loss – say someone who will reject that bet up to a wealth of $11,000. That person will also reject a win$718,190 lose $1,000 bet. Rejection of the win$110, lose $100 bet over that range does not seem as unrealistic, and I could imagine a person with that preference existing. If we empirically tested this, we would also need to examine liquid wealth and cash flow, but the example does provide a sense that we could find some people whose rejection of low value bets would generate absurd results under expected utility maximisation. The log utility function Let’s compare Rabin’s example utility function with a more commonly assumed utility function, that of log utility. Log utility has decreasing absolute risk aversion (and constant relative risk aversion), so is both more empirically defensible and does not generate utility that asymptotes to a maximum like the exponential utility function. A person with log utility would reject the 50:50 bet to win$110, lose $100 up to a wealth of$1,100. Beyond that, they would accept the bet. So, for log utility we should see most people accept this bet.

A person with log utility will reject some quite unbalanced bets: such as a 50:50 bet to win $1 million, lose$90,900, but only up to a wealth of $100,000, beyond which they would accept. Rejection only occurs when a loss is near ruinous. The result is that log utility does not generate the types of rejected bets that Rabin labels as ridiculous, but would also fail to provide much of an explanation for the rejection of low-value bets with positive expected value. The empirical evidence Do people actually turn down 50:50 bets of win$110, lose $100? Surprisingly, I couldn’t find an example of this bet (if someone knows a paper that directly tests this, let me know). Most examinations of loss aversion examine symmetric 50:50 bets where the potential gain and the loss are the same. They compare a bet centred around 0 (e.g. gain$100 or lose $100) and a similar bet in a gain frame (e.g. gain$100 or gain $300, or take$200 for certain). If more people reject the first bet than the latter, then this is evidence of loss aversion.

It makes sense that this is the experimental approach. If the bet is not symmetric, it becomes hard to tease out loss aversion from risk aversion.

However, there is a pattern in the literature that people often reject risky bets with a positive expected value in the ranges explored by Rabin. We don’t know a lot about their wealth (or liquidity), but Rabin’s illustrative numbers for rejected bets don’t seem completely unrealistic. It’s the range of wealth over which the rejection occurs that is questionable.

Rather than me floundering around on this point, there are papers that explicitly ask whether we can observe a set of bets for a group of experimental subjects and map a curve to those choices that resembles expected utility.

For instance, Holt and Laury’s 2002 AER paper (pdf) examined a set of hypothetical and incentivised bets over a range of stakes (finding among other things that hypothetical predictions of their response to incentivised high-stakes bets were not very accurate). They found that if you are flexible about the form of the expected utility function that is used, rejection of small gambles does not result in absurd conclusions on large gambles. The pattern of bets could be made consistent with expected utility, assuming you correctly parameterise the equation. Over subsequent years there was some back and forth on whether this finding was robust [see here (pdf) and here (pdf)], but the basic result seemed to hold.

The utility curve that best matched Holt and Laury’s experimental findings had increasing relative risk aversion, and decreasing absolute risk aversion. By having decreasing absolute risk aversion, the absurd implications of Rabin’s paper are avoided.

Papers such as this suggest that while Rabin’s paper makes an important point, its underlying assumptions are not consistent with empirical evidence. It is possible to have an expected utility maximiser reject low value bets without generating ridiculous outcomes.

So what can you infer about our bettor who has rejected the win $110, lose$100 bet?

From the argument above, I would say not much. We could craft a utility function to accommodate this bet without leading to ridiculous consequences. I personally feel this defence is laboured (that’s a subject for another day), but the bet is not in itself fatal to the argument that they are an expected utility maximiser.

Appendix

The utility of a gain

Let’s suppose someone will reject a 50:50 bet with gain $g$ and loss $l$ for any level of wealth. What utility will they get from a gain of $x$? Rabin defines an upper bound of the utility of gaining $x$ to be:

$U(w+x)-U(w)\leq\sum_{i=0}^{k^{**}(x)}\left(\frac{l}{g}\right)^ir(w)\\$

$k^{**}(x)=int\left(\frac{x}{g}\right)\\$

$r(w)=U(w)-U(w-l)$

This formula effectively breaks down $x$ into $g$ size components, successively discounting each additional $g$ at $\frac{l}{g}$ of the previous $g$.

You need $k^{**}(x)+1$ lots of $g$ to cover $x$. For instance, if $x$ was 32 and we had a 50:50 bet for win $11, lose$10, $\left(\frac{32}{11}\right)=2$. You need 2+1 lots of 11 to fully cover 32. It actually covers a touch more than 32, hence the calculation being for an upper bound.

In the paper, Rabin defines $k^{**}(x)=int\left(\left(\frac{x}{g}\right)+1\right)$ This seems to better capture the required number of $g$ to fully cover $x$, but the iterations in the above formula start at $i=0$. The calculations I run with my version of the formula replicate Rabin’s, supporting the suggestion that the addition of 1 in the paper is an error.

$r(w)$ is shorthand for the amount of utility sacrificed from losing the gamble (i.e. losing $l$). We know that the utility of the gain $g$ is less than this, as the bet is rejected. If we let $r(w)=1$, the equation can be thought of as giving you the maximum utility you could get from the gain of $x$ relative to the utility of the loss of $l$.

Putting this together, the upper bound of the utility of the possible gain $x$ is therefore less than, first, the upper bound of the relative utility from the first $11, $\left(\frac{10}{11}\right)^0r(w)=r(w)$, the upper bound of utility from the next$11, $\left(\frac{10}{11}\right)^1r(w)$, and the upper bound of the utility from the remaining $10 – taking a conservative approach this is calculated as though it were a full$11: $\left(\frac{10}{11}\right)^2r(w)$ .

The utility of a loss

Rabin also gives us a lower bound of the utility of a loss of $x$ for this person who will reject a 50:50 bet with gain $g$ and loss $l$ for any level of wealth:

$U(w)-U(w-x)\geq{2}\sum_{i=1}^{k^{*}(x)}\left(\frac{g}{l}\right)^{i-1}{r(w)}$

$k^{*}(x)=int\left(\frac{x}{2l}\right)$

The intuition behind $k^{*}(x)$ comes from Rabin’s desire to provide a relatively uncomplicated proof for the proposition. Effectively, the utility scales down with each step of $g$ by at least $\frac{g}{l}$. Since Rabin wants to express this in terms of losses, he defines $2l\geq{g}\geq{l}$. He can thereby say that utility scales down by at least $\frac{g}{l}$ every 2 lots of $l$.

Otherwise, the intuition for this loss formula is the same as that for the gain. The summation starts at $i=1$ as this formula is providing a lower bound, so does not require the final iteration to fully cover $x$. The formula is also multiplied by 2 as each iteration covers two lots of $l$, whereby r(w) is for a single span of $l$.

Running some numbers

The below R code implements the above two formulas as a function, calculating the potential utility gain for a win of $G$ or a loss of $L$ for a person who rejects a 50:50 bet win $g$, lose $l$ at all levels of wealth. It then states whether we know the person will reject a win $G$, lose $L$ bet – we can’t state they will accept as we have upper and lower bounds of the utility change from the gain and loss.

Rabin_bet <- function(g, l, G, L){

k_2star <- as.integer(G/g)
k_star <- as.integer(L/(2*l))

U_gain <- 0
for (i in 0:k_2star) {
U_step <- (l/g)^i
U_gain <- U_gain + U_step
}

U_loss <- 0
for (i in 1:k_star) {
U_step <- 2*(g/l)^(i-1)
U_loss <- U_loss + U_step
}

ifelse(U_gain < U_loss,
print("REJECT"),
NA
)
print(paste0("Max U from gain =", U_gain))
print(paste0("Min U from loss =", U_loss))
}


Take a person who will reject a 50:50 bet to win $110, lose$100. Taking the table from the paper, they would reject a win $1,000,000,000, lose$1,000 bet.

Rabin_bet(110, 100, 1000000000, 1000)

[1] "REJECT"
[1] "Max U from gain =11"
[1] "Min U from loss =12.2102"


Relaxing the wealth assumption

In the Appendix of his paper, Rabin defines his proof where the bet is rejected over a range of wealth $w\in(\bar w, \underline{w})$. In that case, relative utility for each additional gain of size $g$ is $\frac{l}{g}$ of the previous $g$ until $\bar w$. Beyond that point, each additional gain of $g$ gives constant utility until $x$ is reached. The formula for the upper bound on the utility gain is:

$U(w+x)-U(w)\leq \begin{cases} \sum_{i=0}^{k^{**}(x)}\left(\frac{l}{g}\right)^ir(w) & if\quad x\leq{\bar w}-w\\ \\ \sum_{i=0}^{k^{**}(\bar w)}\left(\frac{l}{g}\right)^{i}r(w)+\left[\frac{x-(\bar w-w)}{g}\right]\left(\frac{l}{g}\right)^{k^{**}(\bar w)}r(w) & if\quad x\geq{\bar w}-w \end{cases}$

The first term of the equation where $x\geq\bar w-w$ involves iterated discounting as per the situation where the bet is rejected for all levels of wealth, but here the iteration is only up to wealth $\bar w$. The second term of that equation captures the gain beyond $\bar w$ discounted at a constant rate.

There is an error in Rabin’s formula in the paper. Rather than the term $\left[\frac{x-(\bar w-w)}{g}\right]$ in the second equation, Rabin has it as $[x-\bar w]$. As for the previous equations, we need to know the number of iterations of the gain, not total dollars, and we need this between $\bar w$ and $w+x$.

When Rabin provides the examples in Table II of the paper, from the numbers he provides I believe he actually uses a formula of the type $int\left[\frac{x-(w-\underline w)}{g}+1\right]$, which reflects a desire to calculate the upper-bound utility across the stretch above $\bar w$ in a similar manner to below, although this is not strictly necessary given the discount is constant across this range. I have implemented as per my formula, which means that a bet for gain $G$ is rejected $g$ higher than for Rabin (which given their scale is not material).

Similarly, for the loss:

$U(w)-U(w-x)\geq \begin{cases} {2}\sum_{i=1}^{k^{*}(x)}\left(\frac{g}{l}\right)^{i-1}{r(w)} & if\quad {w-\underline w+2l}\geq{x}\geq{2l}\\ \\ {2}\sum_{i=1}^{k^{*}(w-\underline w+2l)}\left(\frac{g}{l}\right)^{i-1}{r(w)}+\ \quad\left[\frac{x-(w-\underline w+l)}{2l}\right]\left(\frac{g}{l}\right)^{k^{*}(w-\underline w+2l)}{r(w)} & if\quad x\geq{w-\underline w+2l} \end{cases}$

There is a similar error here, with Rabin using the term $\left[x-(w-\underline w+l)\right]$ rather than $\left[\frac{x-(w-\underline w+l)}{2l}\right]$. We can’t determine how this was implemented by Rabin as his examples do not examine behaviour below a lower bound $\underline w$.

Running some more numbers

The below code implements the above two formulas as a function, calculating the potential utility gain for a win of $G$ or a loss of $L$ for a person who rejects a 50:50 bet win $g$, lose $l$ at wealth $w\in(\bar w, \underline{w})$. It then states whether we know the person will reject a win $G$, lose $L$ bet – as before, we can’t state they will accept as we have upper and lower bounds of the utility change from the gain and loss.

Rabin_bet_general <- function(g, l, G, L, w, w_max, w_min){

ifelse(
G <= (w_max-w),
k_2star <- as.integer(G/g),
k_2star <- as.integer((w_max-w)/g))

ifelse(w-w_min+2*l >= L
k_star <- as.integer(L/(2*l)),
k_star <- as.integer((w-w_min+2*l)/(2*l))
)

U_gain <- 0
for (i in 0:k_2star){
U_step <- (l/g)^i
U_gain <- U_gain + U_step
}

ifelse(
G <= (w_max-w),
U_gain <- U_gain,
U_gain <- U_gain + ((G-(w_max-w))/g)*(l/g)^k_2star
)

U_loss <- 0
for (i in 1:k_star) {
U_step <- 2*(g/l)^(i-1)
U_loss <- U_loss + U_step
}

ifelse(w-w_min+2l >= L,
U_loss <- U_loss,
U_loss <- U_loss + ((L-(w-w_min+l))/(2*l))*(g/l)^k_star
)

ifelse(U_gain < U_loss,
print("REJECT"),
print("CANNOT CONFIRM REJECT")
)

print(paste0("Max U from gain =", U_gain))
print(paste0("Min U from loss =", U_loss))
}


Imagine someone who turns down the win $110, lose$100 bet with a wealth of $290,000, but who would only reject this bet up to$300,000. They will reject a win $718,190, lose$1000 bet.

Rabin_bet_general(110, 100, 718190, 1000, 290000, 300000, 0)

[1] "REJECT"
[1] "Max U from gain =12.2098745626936"
[1] "Min U from loss =12.2102"

The nature of Rabin’s calculation means that we can scale this calculation to anywhere on the wealth curve. We need only say that someone who rejects this bet over (roughly) a range of $10,000 plus the size of the potential loss will exhibit the same decisions. For example a person with$10,000 wealth who would reject the bet up to $20,000 wealth would also reject the win$718,190, lose $1000 bet. Rabin_bet_general(110, 100, 718190, 1000, 10000, 20000, 0)  [1] "REJECT" [1] "Max U from gain =12.2098745626936" [1] "Min U from loss =12.2102" Comparison with log utility The below is an example with log utility, which is $U(W)=ln(W)$. This function determines whether someone of wealth $w$ will reject of accepta 50:50 bet for gain $g$ and loss $l$. log_utility <- function(g, l, w){ log_gain <- log(w+g) log_loss <- log(w-l) EU_bet <- 0.5*log_gain + 0.5*log_loss EU_certain <- log(w) ifelse(EU_certain == EU_bet, print("INDIFFERENT"), ifelse(EU_certain > EU_bet, print("REJECT"), print("ACCEPT") ) ) print(paste0("Expected utility of bet = ", EU_bet)) print(paste0("Utility of current wealth = ", EU_certain)) }  Testing a few numbers, someone with log utility is indifferent about a 50:50 win$110, lose $100 bet at wealth$1100. They would accept for any level of wealth above that level.

log_utility(110, 100, 1100)

[1] "INDIFFERENT"
[1] "Expected utility of bet = 7.00306545878646"
[1] "Utility of current wealth = 7.00306545878646"

That same person will always accept a 50:50 win $1100, lose$1000 bet above $11,000 in wealth. log_utility(1100, 1000, 11000)  [1] "ACCEPT" [1] "Expected utility of bet = 9.30565055178051" [1] "Utility of current wealth = 9.30565055178051" Can we generate any bets that don’t seem quite right? It’s quite hard unless you have a bet that will bring the person to ruin or near ruin. For instance, for a 50:50 bet with a chance to win$1 million, a person with log utility and $100,000 wealth would still accept the bet with a potential loss of$90,900, which brings them to less than 10% of their wealth.

log_utility(1000000, 90900, 100000)

[1] "ACCEPT"
[1] "Expected utility of bet = 11.5134252151368"
[1] "Utility of current wealth = 11.5129254649702"

The problem with log utility is not the ability to generate ridiculous bets that would be rejected. Rather, it’s that someone with log utility would tend to accept most positive value bets (in fact, they would always take a non-zero share if they could). Only if the bet brings them near ruin (either through size or their lack of wealth) would they turn down the bet.

The isoelastic utility function – of which log utility is a special case – is a broader class of function that exhibits constant relative risk aversion:

$U(x)=\frac{w^{1-\rho}-1}{1-\rho}$

If $\rho=1$, this simplifies to log utility (you need to use L’Hopital’s rule to get this as the fraction is undefined when $\rho=1$.) The higher $\rho$, the higher the level of risk aversion. We implement this function as follows:

CRRA_utility <- function(g, l, w, rho=2){

ifelse(
rho==1,
print("function undefined"),
NA
)

log_gain <- ((w+g)^(1-rho)-1)/(1-rho)
log_loss <- ((w-l)^(1-rho)-1)/(1-rho)

EU_bet <- 0.5*log_gain + 0.5*log_loss
EU_certain <- (w^(1-rho)-1)/(1-rho)

ifelse(EU_certain == EU_bet,
print("INDIFFERENT"),
ifelse(EU_certain > EU_bet,
print("REJECT"),
print("ACCEPT")
)
)

print(paste0("Expected utility of bet = ", EU_bet))
print(paste0("Utility of current wealth = ", EU_certain))
}


If we increase $\rho$, we can increase the proportion of low value bets that are rejected.

For example, a person with $\rho=2$ will reject the 50:50 win $110, lose$100 bet up to a wealth of $2200. The rejection point scales with $\rho$. CRRA_utility(110, 100, 2200, 2)  [1] "INDIFFERENT" [1] "Expected utility of bet = 0.999545454545455" [1] "Utility of current wealth = 0.999545454545455" For a 50:50 chance to win$1 million at wealth $100,000, the person with $\rho=2$ is willing to risk a far smaller loss, and rejects even when the loss is only$48,000, or less than half their wealth (which admittedly is still a fair chunk).

CRRA_utility(1000000, 48000, 100000, 2)

[1] "REJECT"
[1] "Expected utility of bet = 0.99998993006993"
[1] "Utility of current wealth = 0.99999"

Higher values of $\rho$ start to become completely unrealistic as utility is almost flat beyond an initial level of wealth.

It is also possible to have values of $\rho$ between 0 (risk neutrality) and 1. These would result in even fewer rejected low value bets than log utility, and fewer rejected bets with highly unbalanced potential gains and losses.

# My latest article at Behavioral Scientist: Principles for the Application of Human Intelligence

I am somewhat slow in posting this – the article has been up more than a week – but my latest article is up at Behavioral Scientist.

The article is basically an argument that the scrutiny we are applying to algorithmic decision making should also be applied to human decision making systems. Our objective should be good decisions, whatever the source of the decision.

The introduction to the article is below.

Principles for the Application of Human Intelligence

Recognition of the powerful pattern matching ability of humans is growing. As a result, humans are increasingly being deployed to make decisions that affect the well-being of other humans. We are starting to see the use of human decision makers in courts, in university admissions offices, in loan application departments, and in recruitment. Soon humans will be the primary gateway to many core services.

The use of humans undoubtedly comes with benefits relative to the data-derived algorithms that we have used in the past. The human ability to spot anomalies that are missed by our rigid algorithms is unparalleled. A human decision maker also allows us to hold someone directly accountable for the decisions.

However, the replacement of algorithms with a powerful technology in the form of the human brain is not without risks. Before humans become the standard way in which we make decisions, we need to consider the risks and ensure implementation of human decision-making systems does not cause widespread harm. To this end, we need to develop principles for the application for the human intelligence to decision making.

Read the rest of the article here.

# Kahneman and Tversky’s “debatable” loss aversion assumption

Loss aversion is the idea that losses loom larger than gains. It is one of the foundational concepts in the judgment and decision making literature. In Thinking, Fast and Slow, Daniel Kahneman wrote “The concept of loss aversion is certainly the most significant contribution of psychology to behavioral economics.”

Yet, over the last couple of years several critiques have emerged that question the foundations of loss aversion and whether loss aversion is a phenomena at all.

One is an article by Eldad Yechiam, titled Acceptable losses: the debatable origins of loss aversion (pdf). Framed in one case as a spread of the replication crisis to loss aversion, the abstract reads as follows:

It is often claimed that negative events carry a larger weight than positive events. Loss aversion is the manifestation of this argument in monetary outcomes. In this review, we examine early studies of the utility function of gains and losses, and in particular the original evidence for loss aversion reported by Kahneman and Tversky (Econometrica  47:263–291, 1979). We suggest that loss aversion proponents have over-interpreted these findings. Specifically, the early studies of utility functions have shown that while very large losses are overweighted, smaller losses are often not. In addition, the findings of some of these studies have been systematically misrepresented to reflect loss aversion, though they did not find it. These findings shed light both on the inability of modern studies to reproduce loss aversion as well as a second literature arguing strongly for it.

A second, The Loss of Loss Aversion: Will It Loom Larger Than Its Gain (pdf), by David Gal and Derek Rucker, attacks the concept of loss aversion more generally (supposedly the “death knell“):

Loss aversion, the principle that losses loom larger than gains, is among the most widely accepted ideas in the social sciences. The first part of this article introduces and discusses the construct of loss aversion. The second part of this article reviews evidence in support of loss aversion. The upshot of this review is that current evidence does not support that losses, on balance, tend to be any more impactful than gains. The third part of this article aims to address the question of why acceptance of loss aversion as a general principle remains pervasive and persistent among social scientists, including consumer psychologists, despite evidence to the contrary. This analysis aims to connect the persistence of a belief in loss aversion to more general ideas about belief acceptance and persistence in science. The final part of the article discusses how a more contextualized perspective of the relative impact of losses versus gains can open new areas of inquiry that are squarely in the domain of consumer psychology.

A third strain of criticism relates to the concept of ergodicity. Put forward by Ole Peters, the basic claim is that people are not maximising the expected value of a series of gambles, but rather the time average. If people maximise the latter, not the former as many approaches assume, you don’t need risk or loss aversion to explain the decisions. (I’ll leave explaining what exactly this means to a later post.)

I’m as sceptical and cynical about the some of the findings in the behavioural sciences as most (here’s my critical behavioural economics and behavioural science reading list), but I’m not sure I am fully on board with these arguments, particularly the stronger statements of Gal and Rucker. This post is the first of a few rummaging through these critiques to make sense of the debate, starting with Yechiam’s paper on the foundations of loss aversion in prospect theory.

Acceptable losses: the debatable origins of loss aversion

One of the most cited papers in the social sciences is Daniel Kahneman and Amos Tversky’s 1979 paper Prospect Theory: An Analysis of Decision under Risk (pdf). Prospect theory is intended to be a descriptive model of how people make decisions under risk, and an alternative to expected utility theory.

Under expected utility theory, people assign a utility value to each possible outcome of a lottery or gamble, with that outcome typically relating to a final level of wealth. The expected utility for a decision under risk is simply the probability weighted sum of these utilities. The utility of a 50% chance of $0 and a 50% chance of$200 is simply the sum of 50% of the utility of each of $0 and$200.

When utility is assumed to increase at a decreasing rate with each additional dollar of additional wealth – as is typically the case – it leads to risk averse behaviour, with a certain sum preferred to a gamble with an equivalent expected value. For example, a risk averse person would prefer $100 for certain that the 50-50 gamble for$0 or $200. In their 1979 paper, Kahneman and Tversky described a number of departures from expected utility theory. These included: • The certainty effect: People overweight outcomes that are considered certain, relative to outcomes which are merely probable. • The reflection effect: Relative to a reference point, people are risk averse when considering gains, but risk seeking when facing losses. • The isolation effect: People focus on the elements that differ between options rather than those components that are shared. • Loss aversion: Losses loom larger than gains – relative to a reference point, a loss is more painful than a gain of the same magnitude. Loss aversion and the reflection effect result in the following famous diagram of how people weight losses and gains under prospect theory. Loss aversion leads to a kink in the utility curve at the reference point. The curve is steeper below the reference point than above. The reflection effect results in the curve being concave above the reference point, and convex below. Through the paper, Kahneman and Tversky describe experiments on each of the certainty effect, reflection effect, and isolation effect. However, as pointed out by Eldad Yechiam in his paper Acceptable losses: the debatable origins of loss aversion, loss aversion is taken as a stylised fact. Yechiam writes: [I]n their 1979 paper, Kahneman and Tversky (1979) strongly argued for loss aversion, even though, at the time, they had not reported any experiments to support it. By indicating that this was a robust finding in earlier research, Kahneman and Tversky (1979) were able to rely upon it as a stylized fact. They begin their discussion on losses by stating that “a salient characteristic of attitudes to changes in welfare is that losses loom larger than gains” (p. 279), which suggests that this stylized fact is based on earlier findings. They then follow with the (much cited) sentence that “the aggravation that one experiences in losing a sum of money appears to be greater than the pleasure associated with gaining the same amount [17]” (p. 279). Most people who cite this sentence do so without the end quote of Galenter and Pliner (1974). Galenter and Pliner (1974) are, therefore, the first empirical study used to support the notion of loss aversion. So what did Galenter and Pliner find? Yechiam writes: Summing up their findings, Galenter and Pliner (1974) reported as follows: “We now turn to the question of the possible asymmetry of the positive and negative limbs of the utility function. On the basis of intuition and anecdote, one would expect the negative limb of the utility function to decrease more sharply than the positive limb increases… what we have observed if anything is an asymmetry of much less magnitude than would have been expected … the curvature of the function does not change in going from positive to negative” (p. 75). Thus, our search for the historical foundations of loss aversion turns into a dead end on this particular branch: Galenter and Pliner (1974) did not observe such an asymmetry; and their study was quoted erroneously. Effectively, the primary reference for the claim that we are loss averse does not support it. So what other sources did Kahneman and Tversky rely on? Yechiam continues: They argue that “the main properties ascribed to the value function have been observed in a detailed analysis of von Neumann–Morgenstern utility functions for changes of wealth [14].” (p. 281). The citation refers to Fishburn and Kochenberger’s forthcoming paper (at the time; published 1979). Fishburn and Kochenberger’s (1979) study reviews data of five other papers (Grayson, 1960; Green, 1963; Swalm, 1966; Halter & Dean, 1971; Barnes & Reinmuth, 1976) also cited by Kahneman and Tversky (1979). Summing up all of these findings, Kahneman and Tversky (1979) argue that “with a single exception, utility functions were considerably steeper for losses than for gains.” (p. 281). The “single exception” refers to a single participant who was reported not to show loss aversion, while the remaining one apparently did. These five studies all involved very small samples, involving a total of 30 subjects. Yechiam walks through three of the studies. On Swalm (1966): The results of the 13 individuals examined by Swalm … appear at the first glance to be consistent with an asymmetric utility function implying overweighting of losses compared to gains (i.e., loss aversion). Notice, however, that amounts are in the thousands, such that the smallest amount used was set above$1000 and typically above $5000, because it was derived from the participant’s “planning horizon”. Moreover, for more than half of the participants, the utility curve near the origin …, which spans the two smallest gains and two smallest losses for each person, was linear. This deviates from the notion of loss aversion which implies that asymmetries should also be observed for small amounts as well. This point reflects an argument that Yechiam and other have made in several papers (including here and here) that loss aversion is only apparent in high-stakes gambles. When the stakes are low, loss aversion does not appear. On Grayson (1960): A similar pattern is observed in Grayson’s utility functions … The amounts used were also extreme high, with only one or two points below the$50,000 range. For the points above $100,000, the pattern seems to show a clear asymmetry between gains and losses consistent with loss aversion. However, for 2/9 participants …, the utility curve for the points below 100,000 does not indicate loss aversion, and for 2/9 additional participants no loss aversion is observed for the few points below$50,000. Thus, it appears that in Grayson (1960) and Swalm (1966), almost all participants behaved as if they gave extreme losses more weight than corresponding gains, yet about half of them did not exhibit a similar asymmetry for the lower losses (e.g., below $50,000 in Grayson, 1960). Again, loss aversion is stronger for extreme losses. On Green (1963): … Green (1963) did not examine any losses, making any interpretation concerning loss aversion in this study speculative as it rests on the authors’ subjective impression. The results from Swalm (1966), Grayson (1960) and Green (1963) covers 26 of the 30 participants aggregated by Fishburn and Kochenberger. Halter and Dean (1971) and Barnes and Reinmuth (1976) only involved two participants each. So what of other studies that were available to Kahneman and Tversky at the time? In 1955, Davidson, Siegel, and Suppes conducted an experiment in which participants were presented with heads or tails bets which they could accept or refuse. … … Outcomes were in cents and ran up to a gain or loss of 50 cents. The results of 15 participants showed that utility curves for gains and losses were symmetric …, with a loss/ gain utility ratio of 1.1 (far below than the 2.25 estimated by Tversky and Kahneman, 1992). The authors also re-analyzed an earlier data set by Mosteller and Nogee (1951) involving bets for amounts ranging from − 30 to 30 cents, and it too showed utility curves that were symmetric for gains and losses. Lichtenstein (1965) similarly used incentivized bets and small amounts. … Lichtenstein (1965) argued that “The preference for low V [variance] bets indicates that the utility curve for money is not symmetric in its extreme ranges; that is, that large losses appear larger than large wins.” (p. 168). Thus, Lichtenstein (1965) interpreted her findings not as a general aversion to losses (which would include small losses and gains), but only as a tendency to overweight large losses relative to large gains. … Slovic and Lichtenstein (1968) developed a regression-based approach to examine whether the participants’ willingness to pay (WTP) for a certain lottery is predicted more strongly by the size of its gains or the size of its losses. Their results showed that size of losses predicted WTP more than sizes of gains. … Moreover, in a follow-up study, Slovic (1969) found a reverse effect in hypothetical lotteries: Choices were better predicted by the gain amount than the loss amount. In the same study, he found no difference for incentivized lotteries in this respect. Similar findings of no apparent loss aversion were observed in studies that used probabilities that are learned from experience (Katz, 1963; Katz, 1964; Myers & Suydam, 1964). In sum, the evidence for loss aversion at the time of the publication of prospect theory was relatively weak and limited to high-stakes gambles. As Yechiam notes, Kahneman and Tversky only turned their attention to specifically investigating loss aversion in 1992 – and even there it tended to involve large amounts. Only in 1992 did Tversky and Kahneman (1992) and Redelmeier and Tversky (1992) start to empirically investigate loss aversion, and when they did, they used either very large amounts (Redelmeier & Tversky, 1992) or the so-called “list method” in which one chooses between lotteries with changing amounts up until choices switch from one alternative to the other (Tversky & Kahneman, 1992). This usage of high amounts would come to characterize most of the literature later arguing for loss aversion (e.g., Redelmeier & Tversky, 1992; Abdellaoui et al., 2007; Rabin & Weizsäcker, 2009) as would be the usage of decisions that are not incentivized (i.e., hypothetical; as discussed below). I’ll examine the post-1979 evidence in more detail in a future post, but in the interim will note this observation from Yechiam on the more recent experiments. In a review of the literature, Yechiam and Hochman (2013a) have shown that modern studies of loss aversion seem to be binomially distributed into those who used small or moderate amounts (up to$100) and large amounts (above $500). The former typically find no loss aversion, while the latter do. For example, Yechiam and Hochman (2013a) reviewed 11 studies using decisions from description (i.e., where participants are given exact information regarding the probability of gaining and losing money). From these studies, seven did not find loss aversion and all of them used loss/gain amounts of up to$100. Four did find loss aversion, and three of them used very high amounts (above $500 and typically higher). Thus, the usage of high amounts to produce loss aversion is maintained in modern studies. The presence of loss aversion for only large stakes gambles raises some interesting questions. In particular, are we actually observing the effect of “minimal requirements”, whereby a loss would push them below some minimum threshold for, say, survival or other basic necessities? (Or at least a heuristic that operates with that intent?) This is a distinct concept from loss aversion as presented in prospect theory. Finally – and a minor point on the claim that Yechiam’s paper was the beginning of the spread of the replication crisis to loss aversion – there is of course no direct experiment on loss aversion in the initial prospect theory paper to be replicated. A recent replication of the experiments in the 1979 paper had positive results (excepting some mixed results concerning the reflection effect). Replication of the 1979 paper doesn’t, however, resolve provide any evidence on the replicability of loss aversion itself, nor the appropriate interpretation of the experiments. On that point, in my next post on the topic I’ll turn to some of the alternative explanations for what appears to be loss aversion, particularly the claims of Gal and Rucker that losses do not loom larger than gains. # David Leiser and Yhonatan Shemesh’s How We Misunderstand Economics and Why it Matters: The Psychology of Bias, Distortion and Conspiracy From a new(ish) book by David Leiser and Yhonatan Shemesh, How We Misunderstand Economics and Why it Matters: The Psychology of Bias, Distortion and Conspiracy: Working memory is a cognitive buffer, responsible for the transient holding, processing, and manipulation of information. This buffer is a mental store distinct from that required to merely hold in mind a number of items and its capacity is severely limited. The complexity of reasoning that can be handled mentally by a person is bounded by the number of items that can be kept active in working memory and the number of interrelationships between elements that can be kept active in reasoning. Quantifying these matters is complicated, but the values involved are minuscule, and do not exceed four distinct elements … LTM [long-term memory] suffers from a different failing. … It seems there is ample room for our knowledge in the LTM. The real challenge relates to retrieval: people routinely fail to use knowledge that they possess – especially when there is no clear specification of what might be relevant, no helpful retrieval cue. … The two flaws … interact with one another. Ideas and pieces of knowledge accumulate in LTM, but those bits often remain unrelated. Leiser (2001) argues that, since there is no process active in LTM to harmonize inconsistent parts, coordination between elements can only take place in working memory. And in view of its smallness, the scope of explanations is small too. … Limited knowledge, unavailability of many of the relevant economic concepts and variables, and restricted mental processing power mean that incoherencies are to be expected, and they are indeed found. One of the most egregious is the tendency, noted by Furnham and Lewis (1986) who examined findings from the US, the UK, France, Germany, and Denmark, to demand both reductions in taxation and increased public expenditure (especially on schools, the sick, and the old). You can of course see why people would rather pay less in taxes, and also that they prefer to benefit from more services, but it is still surprising how often the link between the two is ignored. This is only possible because, to most people, taxes and services are two unrelated mental concepts, sitting as it were in different parts of LTM, a case of narrow scoping, called by McCaffery and Baron (2006) in this context an “isolation effect.” Bastounis, Leiser, and Roland- Levy ( 2004 ) ran an extensive survey on economic beliefs in several countries (Austria, France, Greece, Israel, New Zealand, Slovenia, Singapore, and Turkey) among nearly 2000 respondents, and studied the correlations between answers to the different questions. No such broad clustering of opinions as that predicted by Salter was in evidence. Instead, the data indicate that lay economic thinking is organized around circumscribed economic phenomena, such as inflation and unemployment, rather than by integrative theories. Simply put, knowing their answers about one question about inflation was a fair predictor of their answer to another, but was not predictive of their views regarding unemployment. A refreshing element of the book is that it draws on a much broader swathe of psychology than just the heuristics and biases literature, which often becomes the focus of stories on why people err. However, I was surprised by the lack of mention of intelligence. A couple of other interesting snippets, the first on the ‘halo effect’: The tendency to oversimplify complex judgments also manifests in the “halo” effect. … [K]nowing a few positive traits of a person leads us to attribute additional positive traits to them. … The halo effect comes from the tendency to rely on global affect, instead of discriminating among conceptually distinct and potentially independent attributes. This bias is unfortunate enough by itself, as it leads to the unwarranted attribution of traits to individuals. But it becomes even more pernicious when it blinds people to the possibility of tradeoffs, where two of the features are inversely correlated. To handle a tradeoff situation rationally, it is essential to disentangle the attributes, and to realize that if one increases the other decreases. When contemplating an investment, for instance, a person must decide whether to invest in stocks (riskier, but with a greater potential return) or in bonds (safer, but offering lower potential returns). Why not go for the best of both worlds – and buy a safe investment that also yields high returns? Because no such gems are on offer. A basic rule in investment pricing is that risk and return are inversely related, and for a good reason. … Strikingly, this relation is systematically violated when people are asked for an independent evaluation of their risk perception and return expectations. Shefrin (2002) asked portfolio managers, analysts, and MBA students for such assessments, and found, to his surprise, that expected return correlates inversely with perceived risk. Respondents appear to expect that riskier stocks will also produce lower returns than safer stocks. This was confirmed experimentally by Ganzach (2000). In the simplest of his several experiments, participants received a list of (unfamiliar) international stock markets. One group of participants was asked to judge the expected return of the market portfolio of these stock markets, and the other was asked to judge the level of risk associated with investing in these portfolios. … The relationship between judgments of risk and judgments of expected return, across the financial assets evaluated, was large and negative (Pearson r = −0.55). Ganzach interprets this finding as showing that both perceived risk and expected return are derived from a global preference. If an asset is perceived as good, it will be judged to have both high return and low risk, whereas if it is perceived as bad, it will be judged to have both low return and high risk. And on whether some examinations of economic comprehension are actually personality tests: Leiser and Benita (in preparation) asked 300 people in the US for their view concerning economic fragility or stability, by checking the extent to which they agreed with the following sentences: 1. The economy is fundamentally sound, and will restore itself after occasional crises. 2. The economy is capable of absorbing limited shocks, but if the shocks are excessive, a major crisis and even collapse will ensue. 3. Deterioration in the economy, when it occurs, is a very gradual process. 4. The economy’s functioning is delicate, and always at a risk of collapse. 5. The economy is an intricate system, and it is all but impossible to predict how it will evolve. 6. Economic experts can ensure that the economy will regain stability even after major crises. These questions relate to the economy, and respondents answered them first. But we then asked corresponding questions, with minimal variations of wording, about three other widely disparate domains: personal relationships, climate change, and health. Participants rated to what extent they agree with each of the statements about each additional domain. The findings were clear: beliefs regarding economic stability are highly correlated with parallel beliefs in unrelated social and natural domains. People who believe that “The economy’s functioning is delicate, and always at a risk of collapse” tend to agree that “Close interpersonal relationships are delicate, and always at a risk of collapse” … And people who hold that “The economy is capable of absorbing limited shocks, but if the shocks are excessive, a major crisis will occur” also tend to judge that “The human body is capable of absorbing limited shocks, but beyond a certain intensity of illness, body collapse will follow.” What we see in such cases is that people don’t assess the economy as an intelligible system. Instead, they express their general feelings towards dangers. … [T]hose who believe that the world is dangerous and who see an external locus of control see all four domains (economics, personal relations, health, and the environment) as unstable and unpredictable. Such judgments have little to do with an evaluation of the domain assessed, be it economic or something else. They attest personal traits, not comprehension. # Nick Chater’s The Mind is Flat: The Illusion of Mental Depth and the Improvised Mind Nick Chater’s The Mind is Flat: The Illusion of Mental Depth and the Improvised Mind is a great book. Chater’s basic argument is that there are no ‘hidden depths’ to our minds. The idea that we have an inner mental world with beliefs, motives and fears is just a work of imagination. As Chater puts it: no one, at any point in human history, has ever been guided by inner beliefs or desires, any more than any human being has been possessed by evil spirits or watched over by a guardian angel. The book represents Chater’s reluctant acceptance that much experimental psychological data can no longer be accommodated by simply extending and modifying existing theories of reasoning and decision making. These theories are built on an intuitive conception of the mind, in which our thoughts and behaviour are rooted in reasoning and built on our deeply held beliefs and desires. As Chater argues, this intuitive conception is simply an illusion. This leads him to take his somewhat radical departure from many theories of perception, reasoning and decision making, I have one major disagreement with the book, which turns out to be a fundamental disagreement with Chater’s central claim, but I’ll come to that later. The visual illusion Chater starts by examining visual perception. This is in part because visual perception is a (relatively) well understood area of psychology and neuroscience, and in part because Chater sees the whole of thought as being an extension of perception. Consider our sense of colour vision. The sensitivity of colour vision falls rapidly outside of the fovea, the area of the retina responsible for our sharp central vision. The rod cells that capture most of our visual field only able to capture light and dark. This means that outside of a few degrees of where you are looking, you are effectively colour blind. Despite this, we feel that our entire visual world is coloured. That is an illusion. Similarly, our visual periphery is fuzzy. Our visual acuity plunges in line with decreasing cone density with the increase in angle. Yet, again, we have a sense that we can capture the entire scene before us. That limited vision is highlighted in experiments using gaze-contingent eye-tracking. In one experiment, participants are asked to read lines of text. Rather than showing the full text, the computer only displayed a window of text where the experimental participants were looking, with all letters outside of that window replaced by blocks of ‘x’s. When someone is reading this text, they feel they are looking at a page or screen full of text. How small can the window of text be before this illusion is shattered? It turns out, the window can be shrunk to around 10 to 15 characters (centred slight right of the fixation point) without the reader sensing anything is amiss. This is despite the page being almost completely covered in ‘x’s. The sense that they are looking at a full page of text is an illusion, as most of the text isn’t there. Chater walks through a range of other interesting experiments showing similar points. For instance, we can only encode one colour or shape or object at a time. The idea we are looking at a rich coloured world, taking in all of the colours and shapes at one, is also an illusion. Our brain is not simultaneously grasping a whole, but is rather piecing together a stream of information. Yet we are fooled into believing we are having a rich sensory experience. We don’t actually see a broad, rich multi-coloured world. The sense that we do is a hoax. So show can the mind execute this hoax? Chater suggests the answer is simply because as soon as we wonder about any aspect of the world, we can simply flick our eyes over and instantly provide an answer. The fluency of this process suggests to us that we already had the answers stored, but the experimental and physiological evidence suggests this cannot be the case. Put another way, the sense of a rich sensory world is actually just the potential to explore a rich sensory world. This potential is misinterpreted as actually experiencing that world. An interesting question posed by Chater later in the book is why don’t we have any awareness of the brain’s mode of thought. Why don’t we sense the continually flickering snapshots generated by our visual system? His answer is that the brain’s goal is to inform us of the world around us. It is not to inform us about the working of our own mechanisms to understand it. The inner world So does story change when we move from visual perception to our inner thoughts? Charter asks us to think of a tiger as clearly and distinctly as we can. Consider the pattern of stripes on the tiger. Count them. What way do they flow over the body? Along the length or vertically? What about on the legs? Visually, we can only grasp fragments at a time, but each visual feature is available on demand, giving the impression that our vision encompasses the whole scene. A similar dynamic is at work for the imaginary tiger. Here the mind improvises the answer as soon as you ask for it. Until you ask the question, those details are entirely absent. What happens when you compare your answer about the tiger’s stripes with a real tiger? For the real tiger, the front legs don’t have stripes. At the back legs the stripes rotate from horizontal around the leg to vertical around the body. The belly and inner legs are white. Were they part of the image in your mind? As we considered the tiger, we invented the answers to the questions we asked. What appeared to be a coherent image was constructed on the fly in the same way our system of visual perception gives us answers as we need them. In one chapter, Chater also argues that we invent our feelings. He describes experimental participants dosed with either adrenaline or a placebo and then placed in a waiting room with a stooge. The stooge was either manic (flying paper aeroplanes) or angry (reacting to a questionnaire they had to fill in while waiting). Those who had been adrenalised had stronger reactions to both stooges, but in opposite directions: euphoric with the manic stooge and irritated in the presence of the angry stooge. Chater argues that we interpret our emotions in the moment based on both the situation we are in and our own physiological state. By being an act of interpretation, having an emotion is an act of reasoning. Improvising our preferences and beliefs The core of Chater’s argument comes when he turns to our preferences and beliefs. And here he argues that we are still relentless improvisers. The famous split brain research of Michael Gazzaniga provides evidence for the improvisation. A treatment for severe epilepsy is surgical severance of the corpus callosum that links the two hemispheres of the brain. This procedure prevents seizures from spreading from one hemisphere to the other, but also results in the two halves of the cortex functioning independently. What if you show different images to the right and left halves of the visual field, which are processed in the opposite hemispheres of the brain (the crossover wiring to the brain means that the right hemisphere processes information in the left visual field, and vice versa)? In one experiment Gazzaniga showed two images to a split brain patient, P.S. On the left hand side was a picture of a snowy scene. On the right was a picture of a chicken’s foot. P.S., like most of us, had his language abilities focused in the left hemisphere of the brain, so P.S. could report seeing the chicken foot but was unable to say anything about the snowy scene. P.S. was asked to pick one of four pictures associated with each of the images. The right hand, controlled by the left hemisphere, picked a chicken head to match the claw. The left hand picked out a shovel for the snow. And how did P.S. explain the choice of the shovel? ‘Oh that’s simple. The chicken claw goes with the chicken. And you need a shovel to clean out the chicken shed.’ An invented explanation. With no insight into the reason, the left hemisphere invents the explanation. This fluent explanation by split brain patients presents the possibility that after-the-fact explanation might also be the case for people with normal brains. Rather than explanations expressing inner preferences and beliefs, we make up reasons in retrospect to interpret our actions. Chater proceeds to build his case that we don’t have such inner beliefs and preferences with some of the less convincing research in the book, much of which looks and feels like a lot of what has been questioned during the replication crisis. It is interesting all the same. In one experiment, voters in Sweden were asked whether they intended to vote for the left or right-leaning coalition. They were then given a questionnaire on various campaign topics. When the responses were handed to the experimenter, the experimenter changed some of the responses by a slight of hand. When they were handed back for checking, just under a quarter of voters spotted and corrected the error. But the majority were happy to explain political opinions that moments ago they did not hold. Chater also reports an experiment where the experimenters got a similar effect when asking people which of two faces they prefer. When the face was switched before asking for the explanation, the fluent explanation still emerged. An interesting twist to this experiment is when people who have been justified a choice of face they didn’t make are asked to choose again. These people tend to choose the face that they didn’t choose previously but were asked to justify. The explanation helped shape future decisions. A similar effect occurred in another experiment in which participants took a web-based survey on political attitudes, with half the participants presented with an American flag in corner of screen. The flag caused a shift in political attitudes. But more interestingly, this effect persisted eight months later. Chater’s interpretation of this experiment is not that Republicans should cover everything with flags. Rather, if people are exposed to a flag at a moment when they are contemplating their political views, this will have a long-lasting effect from the ‘memory traces’ that are laid down at the time. When I read Chater’s summary of the experiment, my immediate reaction was that this was unlikely to replicate – and my reading of the original paper (PDF) firmed my view. And it turns out there was a replication of the first flag priming experiment in the Many Labs project – no effect. (My reaction to the paper might have been shaped by previously reading the Many Labs paper but not immediately recalling that this particular experiment was included.) So let’s scrub this experiment from the list of evidence in support. If there’s no immediate effect, it’s hard to make a case for an effect eight months later. (Chater should have noted this given the replication was published in 2014.) This isn’t the only experiment reported by Chater with a failed replication in this section, although the other dates from after publication of the book. An experiment by Eldar Shafir that makes an appearance failed to replicate in Many Labs 2. One other piece of evidence called on by Chater is the broad (and strong) evidence of the inconsistency of our risk preferences and how susceptible they are to the framing of the risk and the domain in which they are realised. Present the same gamble in a loss rather than a gain frame, and risk-seeking choices spike. But putting these pieces together, I am not convinced Chater has made his case. The split brain experiments demonstrate our willingness to improvise explanations in the absence of any evidence. But this does not extend to an unequivocal case that we we don’t call on any “hidden depths” that are there. They are variable, but are they so variable that they have no deeper basis at all? Chater thinks so. [N]o amount of measuring and re-measuring is going to help. The problem with measuring risk preferences is not that measurement is difficult and inaccurate; it is that there are no risk preferences to measure – there is simply no answer to how, ‘deep down’, we wish to balance risk and reward. And, while we’re at it, the same goes for the way people trade off the present against the future; how altruistic we are and to whom; how far we display prejudice on gender or race, and so on. But this brings me to my major disagreement with Chater. For all Chater’s sweeping statements about our lack of hidden depths, he didn’t spend much effort trying to find them. Rather, he took a lot of evidence on how manipulable we can be (which we certainly are to a degree) and our willingness to improvise explanations when we have no idea (more robust), and then turned this into a finding that there is no hidden depth. One place Chater could have looked is behavioural genetics. The first law of behavioural genetics is that all behavioural traits are heritable. That is, a proportion of the variation in these characteristics between people are due to genetic variation. These traits include risk preferences, the way we trade off the past and the future, and political preferences. These are among the characteristics that Chater suggests have no hidden depth. If there is no hidden depth, why are identical twins (even raised part) so similar for these traits Chater is likely right that when asked to explain why we took a certain risky preference we are likely to improvise an explanation with little connection to reality. We rarely point to our genes. But that does not mean the hidden depth is not there. We can only have one thought at a time Once Chater has completed his argument about our lack of hidden depths, he turns to describing his version of how the mind actually works. And part of that answer is that the brain can only tackle one problem at a time. This inability to take on multiple tasks comes from the way that our brain computes when facing a difficult problem. Computation in the brain occurs through cooperation across the brain, with coordinated neural activity occurring across whole networks or entire regions of the brain. This large cooperative activity between slow neurons means that a network can only work on one problem at a time. And the brain is close to one large network. Chater turns this idea into an attack on the “myth of the unconscious”. This myth is the idea that our brain is working away in the background. If we step away from a problem, we might suddenly have the answer pop into our head as our unconscious has kept working at the problem while we tend to other things. Chater argues that for all the stories about scientists suddenly having major breakthroughs in the shower, neuroscience has found no evidence of these hidden processes. Chater summaries the studies in this area as concluding that, first, the effects of breaks either negligible or non-existent, and second, that the explanations for the minor effects of a break involve no unconscious thought at all. As one example of the lack of effect, Chater describes an experiment in which subjects are asked to name both as many food items and as many countries as possible. Someone doing this task might switch back and forth between the two topics, changing to foods when they run out of countries and vice versa. How would the performance of a person able to switch back and forth compare to someone who has to first deal with one category, and only when finished move to the other? Would the former outperform as they could think about the second category in the background before coming back to it? The results suggest that when thinking about countries, there is no evidence that we are also thinking about food. When we switch from one category to the other, the search ceases abruptly. So how did this myth of unconscious thought arise? Chater’s argument is that when we set a problem aside and return to it later, we are unencumbered by the past failures and patterns of thought in which we were trapped before. The new perspective may not be better than the old, but occasionally it will hit upon the angle that we need to solve the problem. So yes, the insight may emerge in a flash, but not because the unconscious had been grinding away at the problem. This lack of unconscious thought is also demonstrated in the the literature concerning inattentional blindness. If people are busy attending to a task, they can miss information that they are not attending to. The classic example of this (at least, before the gorilla experiment) is an experiment by Ulric Neisser, in which participants are asked to watch three people throwing a ball to each other and press a button each time there was a throw. When an unexpected event occurs – in this case a woman with an umbrella walking through the players – less than one quarter of the participants noticed. Chater takes the inattentional blindness studies as again showing that we can only lock onto and impose meaning on one fragment of sensory information at a time. If our brains are busy on one task, they can be utterly oblivious to other events. One distinction Chater makes that I found useful is how to think about our unconscious thought processes. Chater’s argument is not that there is no processing in the brain outside our conscious knowledge. Rather, we have one type of thought, with unconscious processing resulting a a conscious result. Chater writes: The division between the conscious and the unconscious does not distinguish between different types of thought. Instead, it is a division within individual thoughts themselves: between the conscious result of our thinking and the unconscious processes that create it. There are no conscious thoughts and unconscious thoughts; and there are certainly no thoughts slipping in and out of consciousness. There is just one type of thought, and each such thought has two aspects: a conscious read-out, and unconscious processes generating the read-out. So where do our actions come from? So if there are no hidden depths, what drives us? Chater’s argument is that our thoughts come from memory traces created by previous thoughts and experiences. Each person is shaped by, and in effect unique due to, the uniqueness of their past thoughts and experiences. Thought follows channels carved by previous thoughts. This argument does in some ways suggest that we have an inner-world. But that inner world is a record of the effect of the past cycles of thought. It is not an inner world of beliefs, hopes and fears. As Chater states, the brain operates based on precedents, not principles. Chater’s first piece of evidence in support of this point comes from chess. What makes grandmasters special? It is not because humans are lightning calculating machines. Rather it is because of their long experience and their ability to find meaning in chess positions with great fluency. They can link the current position with memory traces of past board positions. They do not succeed by looking further ahead, but rather by drawing on a deeper memory bank and then focusing on only the best moves. Chater argues that this is how perception works more generally. We do not interpret sensory information afresh, but interpret based on memory traces from past experience. He gives the example of “found faces”, where people see faces in inanimate objects. Our interpretation of the inputs finds resonance with memory traces of past inputs. Similarly, recognising a friend, word or tune depend on a link with your memories. Successful perception requires us to deploy the right memory traces when we need them. Chater’s argument of the role of memory in perception seems sound. But absent the clear case that there there are no other sources of beliefs or motivations, I am not convinced these memory traces are all that there is. What this means for intelligence and AI The final chapter of the book is Chater’s attempt to put a positive gloss on his argument. It feels like the sort of chapter that the publisher might ask for to help with the promotion of the book. That positive gloss is human creativity. Chater writes: But the secret of human intelligence is the ability to find patterns in the least structured, most unexpected, hugely variable of streams of information – to lock onto a handbag and see a snarling face; to lock onto a set of black-and-white patches and discern a distinctive, emotion-laden, human being; to find mappings and metaphors through the complexity and chaos of the physical and psychological worlds. All this is far beyond the reach of modern artificial intelligence. I am not sure I agree. Vision recognition systems regularly make errors through seeing patterns that aren’t there. Are these just the machine version of seeing a face in a handbag? Both are mismatches, but one is labelled as an imaginative leap, the other as an error. Should we endow this overactive human pattern matching with the title of intelligence and call a similar matching errors when done by a computer a mistake? Chess is also instructive here, with a sign of a machine move now often being great creativity. This final chapter is somewhat shallow relative to the rest of the book. Chater provides little in the way of evidence to support his case, although you can piece together some threads supporting Chater yourself from the examples discussed earlier in the book. It ends the book with a nice hook, but for me was a flat ending for an otherwise great book. # Debating the conjunction fallacy From Eliezer Yudkowsky on Less Wrong (a few years old, but worth revisiting in the light of my recent Gigerenzer v Kahneman and Tversky post): When a single experiment seems to show that subjects are guilty of some horrifying sinful bias – such as thinking that the proposition “Bill is an accountant who plays jazz” has a higher probability than “Bill is an accountant” – people may try to dismiss (not defy) the experimental data. Most commonly, by questioning whether the subjects interpreted the experimental instructions in some unexpected fashion – perhaps they misunderstood what you meant by “more probable”. Experiments are not beyond questioning; on the other hand, there should always exist some mountain of evidence which suffices to convince you. Here is (probably) the single most questioned experiment in the literature of heuristics and biases, which I reproduce here exactly as it appears in Tversky and Kahneman (1982): Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Please rank the following statements by their probability, using 1 for the most probable and 8 for the least probable: (5.2) Linda is a teacher in elementary school. (3.3) Linda works in a bookstore and takes Yoga classes. (2.1) Linda is active in the feminist movement. (F) (3.1) Linda is a psychiatric social worker. (5.4) Linda is a member of the League of Women Voters. (6.2) Linda is a bank teller. (T) (6.4) Linda is an insurance salesperson. (4.1) Linda is a bank teller and is active in the feminist movement. (T & F) (The numbers at the start of each line are the mean ranks of each proposition, lower being more probable.) How do you know that subjects did not interpret “Linda is a bank teller” to mean “Linda is a bank teller and is not active in the feminist movement”? For one thing, dear readers, I offer the observation that most bank tellers, even the ones who participated in anti-nuclear demonstrations in college, are probably not active in the feminist movement. So, even so, Teller should rank above Teller & Feminist. … But the researchers did not stop with this observation; instead, in Tversky and Kahneman (1983), they created a between-subjects experiment in which either the conjunction or the two conjuncts were deleted. Thus, in the between-subjects version of the experiment, each subject saw either (T&F), or (T), but not both. With a total of five propositions ranked, the mean rank of (T&F) was 3.3 and the mean rank of (T) was 4.4, N=86. Thus, the fallacy is not due solely to interpreting “Linda is a bank teller” to mean “Linda is a bank teller and not active in the feminist movement.” Another way of knowing whether subjects have misinterpreted an experiment is to ask the subjects directly. Also in Tversky and Kahneman (1983), a total of 103 medical internists … were given problems like the following: A 55-year-old woman had pulmonary embolism documented angiographically 10 days after a cholecstectomy. Please rank order the following in terms of the probability that they will be among the conditions experienced by the patient (use 1 for the most likely and 6 for the least likely). Naturally, the patient could experience more than one of these conditions. • Dyspnea and hemiparesis • Calf pain • Pleuritic chest pain • Syncope and tachycardia • Hemiparesis • Hemoptysis As Tversky and Kahneman note, “The symptoms listed for each problem included one, denoted B, that was judged by our consulting physicians to be nonrepresentative of the patient’s condition, and the conjunction of B with another highly representative symptom denoted A. In the above example of pulmonary embolism (blood clots in the lung), dyspnea (shortness of breath) is a typical symptom, whereas hemiparesis (partial paralysis) is very atypical.” In indirect tests, the mean ranks of A&B and B respectively were 2.8 and 4.3; in direct tests, they were 2.7 and 4.6. In direct tests, subjects ranked A&B above B between 73% to 100% of the time, with an average of 91%. The experiment was designed to eliminate, in four ways, the possibility that subjects were interpreting B to mean “only B (and not A)”. First, carefully wording the instructions: “…the probability that they will be among the conditions experienced by the patient”, plus an explicit reminder, “the patient could experience more than one of these conditions”. Second, by including indirect tests as a comparison. Third, the researchers afterward administered a questionnaire: In assessing the probability that the patient described has a particular symptom X, did you assume that (check one): X is the only symptom experienced by the patient? X is among the symptoms experienced by the patient? 60 of 62 physicians, asked this question, checked the second answer. Fourth and finally, as Tversky and Kahneman write, “An additional group of 24 physicians, mostly residents at Stanford Hospital, participated in a group discussion in which they were confronted with their conjunction fallacies in the same questionnaire. The respondents did not defend their answers, although some references were made to ‘the nature of clinical experience.’ Most participants appeared surprised and dismayed to have made an elementary error of reasoning.” Does the conjunction fallacy arise because subjects misinterpret what is meant by “probability”? This can be excluded by offering students bets with payoffs. In addition to the colored dice discussed yesterday, subjects have been asked which possibility they would prefer to bet$10 on in the classic Linda experiment. This did reduce the incidence of the conjunction fallacy, but only to 56% (N=60), which is still more than half the students.

But the ultimate proof of the conjunction fallacy is also the most elegant. In the conventional interpretation of the Linda experiment, subjects substitute judgment of representativeness for judgment of probability: Their feelings of similarity between each of the propositions and Linda’s description, determines how plausible it feels that each of the propositions is true of Linda. …

You just take another group of experimental subjects, and ask them how much each of the propositions “resembles” Linda. This was done – see Kahneman and Frederick (2002) – and the correlation between representativeness and probability was nearly perfect.  0.99, in fact.

The conjunction fallacy is probably the single most questioned bias ever introduced, which means that it now ranks among the best replicated. The conventional interpretation has been nearly absolutely nailed down.

There are a few additional experiments in Yudkowsky’s post that I have not replicated here.

# Three algorithmic views of human judgment, and the need to consider more than algorithms

From Gerd Gigerenzer’s The bounded rationality of probabilistic mental models (PDF) (one of the papers mentioned in my recent post on the Kahneman and Tversky and Gigerenzer debate):

Defenders and detractors of human rationality alike have tended to focus on the issue of algorithms. Only their answers differ. Here are some prototypical arguments in the current debate.

Statistical algorithms

Cohen assumes that statistical algorithms … are in the mind, but distinguishes between not having a statistical rule and not applying such as rule, that is, between competence and performance. Cohen’s interpretation of cognitive illusions parallels J.J. Gibson’s interpretation of visual illusions: illusions are attributed to non-realistic experimenters acting as conjurors, and to other factors that mask the subjects’ competence: ‘unless their judgment is clouded at the time by wishful thinking, forgetfulness, inattentiveness, low intelligence, immaturity, senility, or some other competence-inhibiting factor, all subjects reason correctly about probability: none are programmed to commit fallacies or indulge in illusions’ … Cohen does not claim, I think, that people carry around the collected works of Kolmogoroff, Fisher, and Neyman in their heads, and merely need to have their memories jogged, like the slave in Plato’s Meno. But his claim implies that people do have at least those statistical algorithms in their competence that are sufficient to solve all reasoning problems studied in the heuristics and biases literature, including the Linda problem

Non-statistical algorithms: heuristics

Proponents of the heuristics-and-biases programme seem to assume that the mind is not built to work by the rules of probability:

In making predictions and judgments under uncertainty, people do not appear to follow the calculus of chance or the statistical theory of prediction. Instead they rely on a limited number of heuristics which sometimes yield reasonable judgments and sometimes lead to severe and systematic errors.

(Kahneman and Tversky, 1973:237)

Cognitive illusions are explained by non-statistical algorithms, known as cognitive heuristics.

Statistical and non-statistical heuristics

Proponents of a third position do not want to be forced to choose between statistical and non-statistical algorithms, but want to have them both. Fong and Nisbett … argue that people possess both rudimentary but abstract intuitive versions for statistical principles such as the law of large numbers, and non-statistical heuristics such as representativeness. The basis for these conclusions are the results of training studies. For instance, the experimenters first teach the subject the law of large numbers or some other statistical principle, and subsequently also explain how to apply this principle to a real-world domain such as sports problems. Subjects are then tested on similar problems front he same or other domains. The typical result is that more subjects reasons statistically, but transfer to domains not trained in is often low.

However, Gigerenzer argues that we need to consider more than just the mental algorithms.

Information needs representation. In order to communicate information, it has to be represented in some symbols system. Take numerical information. This information can be represented by the Arabic numeral system, by the binary system, by Roman numbers, or other systems. These different representations can be mapped in a one-to-one way, and are in this sense equivalent representations. But they are not necessarily equivalent for an algorithm. Pocket calculators, for instance, generally work on the Arabic base-10 system, whereas general purpose computers work on the base-2 system. The numerals 10000 and 32 are representations of the number thirty-two in the binary and Arabic system, respectively. The algorithms of my pocket calculator will perform badly with the first kind of representation but work well on the latter.

The human mind finds itself in an analogous situation. The algorithms most Western people have stored in their minds – such as how to add, subtract and multiply – work well on Arabic numerals. But contemplate for a moment division in Roman numerals, without transforming them first into Arabic numerals.

There is more to the distinction between an algorithm and a representation of information. Not only are algorithms tuned to particular representations, but different representations make explicit different features of the same information. For instance, one can quickly see whether a number is a power of 10 in an Arabic numeral representation, whereas to see whether that number is a power of 2 is more difficult. The converse holds with binary numbers. Finally, algorithms are tailored to given representations. Some representations allow for simpler and faster algorithms than others. Binary representation, for instance, is better suited to electronic techniques than Arabic representation. Arabic numerals, on the other hand, are better suited to multiplication and elaborate mathematical algorithms than Roman numerals …

# Gigerenzer versus Kahneman and Tversky: The 1996 face-off

Through the late 1980s and early 1990s, Gerd Gigerenzer and friends wrote a series of articles critiquing Daniel Kahneman and Amos Tversky’s work on heuristic and biases. They hit hard. As Michael Lewis wrote in The Undoing Project:

Gigerenzer had taken the same angle of attack as most of their other critics. But in Danny and Amos’s view he’d ignored the usual rules of intellectual warfare, distorting their work to make them sound even more fatalistic about their fellow man than they were. He also downplayed or ignored most of their evidence, and all of their strongest evidence. He did what critics sometimes do: He described the object of his scorn as he wished it to be rather than as it was. Then he debunked his description. … “Amos says we absolutely must do something about Gigerenzer,” recalled Danny. … Amos didn’t merely want to counter Gigerenzer; he wanted to destroy him. (“Amos couldn’t mention Gigerenzer’s name without using the word ‘sleazeball,’ ” said UCLA professor Craig Fox, Amos’s former student.) Danny, being Danny, looked for the good in Gigerenzer’s writings. He found this harder than usual to do.

Kahneman and Tversky’s response to Gigerenzer’s work was published in 1996 in Psychological Review. It was one of the blunter responses you will read in academic debates, as the following passages indicate. From the first substantive section of the article:

It is not uncommon in academic debates that a critic’s description of the opponent’s ideas and findings involves some loss of fidelity. This is a fact of life that targets of criticism should learn to expect, even if they do not enjoy it. In some exceptional cases, however, the fidelity of the presentation is so low that readers may be misled about the real issues under discussion. In our view, Gigerenzer’s critique of the heuristics and biases program is one of these cases.

And the close:

As this review has shown, Gigerenzer’s critique employs a highly unusual strategy. First, it attributes to us assumptions that we never made … Then it attempts to refute our alleged position by data that either replicate our prior work … or confirm our theoretical expectations … These findings are presented as devastating arguments against a position that, of course, we did not hold. Evidence that contradicts Gigerenzer’s conclusion … is not acknowledged and discussed, as is customary; it is simply ignored. Although some polemic license is expected, there is a striking mismatch between the rhetoric and the record in this case.

Below are my notes put together on a 16-hour flight on the claims and counterclaims across Gigerenzer’s articles, the Kahneman and Tversky response in Psychological Review, and Gigerenzer’s rejoinder in the same issue. This represents my attempt to get my head around this debate and to understand the degree to which the heat is justified, not to give final judgment (although I do show my leanings). I don’t go to work published after the 1996 articles, although that might be for another day.

I will use Gigerenzer or Kahneman and Tversky’s words to make their arguments when I can. The core articles I refer to are:

Gigerenzer (1991) How to Make Cognitive Illusions Disappear: Beyond “Heuristics and Biases” (pdf)

Gigerenzer (1993) The bounded rationality of probabilistic mental models (pdf)

Kahneman and Tversky (1996) On the Reality of Cognitive Illusions (pdf)

Kahneman and Tversky (1996) Postscript (at the end of their 1996 paper)

Gigerenzer (1996) Postscript (at the end of his 1996 paper)

I recommend reading those articles, along with Kahneman and Tversky’s classic Science article (pdf) as background. (And note that the below debate and Gigerenzer’s critique only relates to two of the 12 “biases” covered in that paper.)

I touch on four of Gigerenzer’s arguments (using most of my word count on the first), although there are numerous other fronts:

• Argument 1: Does the use of frequentist rather than probabilistic representations make many of the so-called biases disappear? Despite appearances, Kahneman, Tversky and Gigerenzer largely agree on the answer to this question. However, it was largely Gigerenzer’s work that brought this to my attention, so there was clearly some value (for me) to Gigerenzer’s focus.
• Argument 2: Can you attribute probabilities to single events? Gigerenzer says no. Here there is a fundamental disagreement. I largely agree with Kahneman and Tversky as to whether this point is fatal to their work.
• Argument 3: Are Kahneman and Tversky’s norms content blind? For particular examples, yes. Generally? No.
• Argument 4: Should more effort be expended in understanding the underlying cognitive processes or mental models behind these various findings? This is where Gigerenzer’s argument is strongest, and I agree that many of Kahneman and Tversky’s proposed heuristics have weaknesses that need examination.

Putting these four together, I have sympathy for Gigerenzer’s way of thinking and ultimate program of work, but I am much less sympathetic to his desire to pull down Kahneman and Tversky’s findings on the way.

Now into the details.

Argument 1: Does the use of frequentist rather than probabilistic representations make many of the so-called biases disappear?

Gigerenzer’s argues that many biases involving probabilistic decision-making can be “made to disappear” by framing the problems in terms of frequencies rather than probabilities. The back-and-forth on this point centres on three major biases: overconfidence, the conjunction fallacy and base-rate neglect. I’ll take each in turn.

Overconfidence

A typical question from the overconfidence literature reads as follows:

Which city has more inhabitants?

50% 60% 70% 80% 90% 100%

After answering many questions of this form, the usual finding is that where people are 100% confident they had the correct answer, they might be correct only 80% of the time. When 80% confident, they might get only 65% correct. This discrepancy is often called “overconfidence”. [I’ve written elsewhere about the need to disambiguate different forms of overconfidence.]

There are numerous explanations for this overconfidence, such as confirmation bias, although in Gigerenzer’s view this is “a robust fact waiting for a theory”.

But what if we take a different approach to this problem. Gigerenzer (1991) writes:

Assume that the mind is a frequentist. Like a frequentist, the mind should be able to distinguish between single-event confidences and relative frequencies in the long run.

This view has testable consequences. Ask people for their estimated relative frequencies of correct answers and compare them with true relative frequencies of correct answers, instead of comparing the latter frequencies with confidences.

He tested this idea as follows:

After a set of 50 general knowledge questions, we asked the same subjects, “How many of these 50 questions do you think you got right?”. Comparing their estimated frequencies with actual frequencies of correct answers made “overconfidence” disappear. …

The general point is (i) a discrepancy between probabilities of single events (confidences) and long-run frequencies need not be framed as an “error” and called “overconfidence bias”, and (ii) judgments need not be “explained” by a flawed mental program at a deeper level, such as “confirmation bias”.

Kahneman and Tversky agree:

May (1987, 1988) was the first to report that whereas average confidence for single items generally exceeds the percentage of correct responses, people’s estimates of the percentage (or frequency) of items that they have answered correctly is generally lower than the actual number. … Subsequent studies … have reported a similar pattern although the degree of underconfidence varied substantially across domains.

Gigerenzer portrays the discrepancy between individual and aggregate assessments as incompatible with our theoretical position, but he is wrong. On the contrary, we drew a distinction between two modes of judgment under uncertainty, which we labeled the inside and the outside views … In the outside view (or frequentistic approach) the case at hand is treated as an instance of a broader class of similar cases, for which the frequencies of outcomes are known or can be estimated. In the inside view (or single-case approach) predictions are based on specific scenarios and impressions of the particular case. We proposed that people tend to favor the inside view and as a result underweight relevant statistical data. …

The preceding discussion should make it clear that, contrary to Gigerenzer’s repeated claims, we have neither ignored nor blurred the distinction between judgments of single and of repeated events. We proposed long ago that the two tasks induce different perspectives, which are likely to yield different estimates, and different levels of accuracy (Kahneman and Tversky, 1979). As far as we can see, Gigerenzer’s position on this issue is not different from ours, although his writings create the opposite impression.

So we leave this point with a degree of agreement.

Conjunction fallacy

The most famous illustration of the conjunction fallacy is the “Linda problem”. Subjects are shown the following vignette:

Linda is 31 years old, single, outspoken and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in antinuclear demonstrations.

They are then asked which of the following two alternatives was more probable (either as just those two options, as part of a longer list of options, or across different experimental subjects):

Linda is a bank teller
Linda is a bank teller and is active in the feminist movement

In the original Tversky and Kahneman experiment, when shown only those two options, 85% of subjects chose the second. Tversky and Kahneman argued this was an error as the probability of the conjunction of two events can never be greater than one of its constituents.

Once again Gigerenzer reframed for the frequentist mind (quoting from the 1996 article):

There are 100 persons who fit the description above (i.e. Linda’s). How many of them are:

(a) bank tellers
(b) bank tellers and active in the feminist movement.

As Gigerenzer states:

If the problem is phrased in this (or a similar) frequentist way, then the “conjunction fallacy” largely disappears.

The postulated representativeness heuristic cannot account for this dramatic effect.

Gigerenzer’s 1993 article expands on this latter point:

If the mind solves the problem using a representative heuristic, changes in representation should not matter, because they do not change the degree of similarity. … Subjects therefore should still exhibit the conjunction fallacy.

Kahneman and Tversky’s response starts with the note that their first demonstration of the conjunction fallacy involved judgments of frequency. They asked subjects:

to estimate the number of “seven-letter words of the form ‘—–n-‘ in 4 pages of text.” Later in the same questionnaire, those subjects estimated the number of “seven-letter words of the form ‘—-ing’ in 4 pages of text.” Because it is easier to think of words ending with “ing” than to think of words with “n” in the next-to-last position, availability suggests that the former will bejudged more numerous than the latter, in violation of the conjunction rule. Indeed, the median estimate for words ending with “ing” was nearly three times higher than for words with “n” in the next-to-the-last position. This finding is a counter-example to Gigerenzer’s often repeated claim that conjunction errors disappear in judgments of frequency, but we have found no mention of it in his writings.

Here Gigerenzer stretches his defence of human consistency a step too far:

[T]he effect depends crucially on presenting the two alternatives to a participant at different times, that is, with a number (unspecified in their reports) of other tasks between the alternatives. This does not seem to be a violation of internal consistency, which I take to be the point of the conjunction fallacy.

Kahneman and Tversky also point out that they they had studied the effect of frequencies in other contexts:

We therefore turned to the study of cues that may encourage extensional reasoning and developed the hypothesis that the detection of inclusion could be facilitated by asking subjects to estimate frequencies. To test this hypothesis, we described a health survey of 100 adult men and asked subjects, “How many of the 100 participants have had one or more heart attacks?” and “How many of the 100 participants both are over 55 years old and have had one or more heart attacks?” The incidence of conjunction errors in this problem was only 25%, compared to 65% when the subjects were asked to estimate percentages rather than frequencies. Reversing the order of the questions further reduced the incidence to 11%.

Kahneman and Tversky go on to state:

Gigerenzer has essentially ignored our discovery of the effect of frequency and our analysis of extensional cues. As primary evidence for the “disappearance” of the conjunction fallacy in judgments of frequency, he prefers to cite a subsequent study by Fiedler (1988), who replicated both our procedure and our findings, using the bank-teller problem. … In view of our prior experimental results and theoretical discussion, we wonder who alleged that the conjunction fallacy is stable under this particular manipulation.

Gigerenzer concedes, but then turns to Kahneman and Tversky’s lack of focus on this result:

It is correct that they demonstrated the effect on conjunction violations first (but not for overconfidence bias and the base-rate fallacy). Their accusation, however, is out of place, as are most others in their reply. I referenced their demonstration in every one of the articles they cited … It might be added that Tversky and Kahneman (1983) themselves paid little attention to this result, which was not mentioned once in some four pages of discussion.

A debate about who was first and how much focus each gave to the findings is not substantive, but Kahneman and Tversky (1996) do not leave this problem here. While the frequency representation can reduce error when there is the possibility of direct comparison (the same subject sees and provides frequencies for both alternatives), they have less effect in between-subject experiment designs; that is, where one set of subjects will see one of the options and another set of subject the other:

Linda is in her early thirties. She is single, outspoken, and very bright. As a student she majored in philosophy and was deeply concerned with issues of discrimination and social justice.

Suppose there are 1,000 women who fit this description. How many of them are

(a) high school teachers?

(b) bank tellers? or

(c) bank tellers and active feminists?”

One group of Stanford students (N = 36) answered the above three questions. A second group (N = 33) answered only questions (a) and (b), and a third group (N = 31) answered only questions (a) and (c). Subjects were provided with a response scale consisting of 11 categories in approximately logarithmic spacing. As expected, a majority (64%) of the subjects who had the opportunity to compare (b) and (c) satisfied the conjunction rule. In the between-subjects comparison, however, the estimates for feminist bank tellers (median category: “more than 50”) were significantly higher than the estimates for bank tellers … Contrary to Gigerenzer’s position, the results demonstrate a violation of the conjunction rule in a frequency formulation. These findings are consistent with the hypothesis that subjects use representativeness to estimate outcome frequencies and edit their responses to obey class inclusion in the presence of strong extensional cues.

Gigerenzer in part concedes, and in part battles on:

Hence, Kahneman and Tversky (1996) believe that the appropriate reply is to show that frequency judgments can also fail. There is no doubt about the latter …

[T]he between subjects version of the Linda problem is not a violation of internal consistency, because the effect depends on not presenting the two alternatives to the same subject.

It’s right not to describe this as a violation of internal consistency, but for evidence of representativeness affecting judgement and doing so even with frequentist representations, it makes a good case. It is also difficult to argue that the subjects are making a good judgment. Kahneman and Tversky write:

Gigerenzer appears to deny the relevance of the between-subjects design on the ground that no individual subject can be said to have committed an error. In our view, this is hardly more reasonable than the claim that a randomized between-subject design cannot demonstrate that one drug is more effective than another because no individual subject has experienced the effects of both drugs.

Kahneman and Tversky write further in the postscript, possibly conceding on language but not on their substantive point:

This formula will not do. Whether or not violations of the conjunction rule in the between-subjects versions of the Linda and “ing” problems are considered errors, they require explanation. These violations were predicted from representativeness and availability, respectively, and were observed in both frequency and probability judgments. Gigerenzer ignores this evidence for our account and offers no alternative.

I’m with Kahneman and Tversky here.

Base-rate neglect

Base-rate neglect (or the base-rate fallacy) describes situations where a known base rate of an event or characteristic in a reference population is under-weighted, with undue focus given to specific information on the case at hand. An example is as follows:

If a test to detect a disease whose prevalence is 1/1000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming you know nothing about the person’s symptoms or signs?

The typical result is that around half of the people asked will guess a probability of 95% (even among medical professionals), with less than a quarter giving the correct answer of 2%. The positive result, which has associated errors, is weighted too heavily relative to the base rate of one in a thousand.

Gigerenzer (1991) once again responds with the potential of a frequentist representation to eliminate the bias, drawing on work by Cosmides and Tooby (1990) [The 1990 paper was an unpublished conference paper, but this work was later published here (pdf)]:

One our of 1000 Americans has disease X. A test has been developed to detect when a person has disease X. Every time the test is given to a person who has he disease, the test comes out positive. But sometimes the test also comes out positive when it is given to a person who is completely healthy. Specifically, out of every 1000 people who are perfectly healthy, 50 of them test positive for the disease.

Imagine that we have assembled a random sample of 1000 Americans. They were selected by a lottery. Those who conducted the lottery had no information about the health status of any of these people. How many people who test positive for the disease will actually have the disease? — out of —.

The result:

If the question was rephrased in a frequentist way, as shown above, then the Bayesian answer of 0.02 – that is, the answer “one out of 50 (or 51); – was given by 76% of the subjects. The “base-rate fallacy” disappeared.

Kahneman and Tversky (1996) do not respond to this particular example, beyond a footnote:

Cosmides and Tooby (1996) have shown that a frequentistic formulation also helps subjects solve a base-rate problem that is quite difficult when framed in terms of percentages or probabilities. Their result is readily explained in terms of extensional cues to set inclusion. These authors, however, prefer the speculative interpretation that evolution has favored reasoning with frequencies but not with percentages.

It seems we have agreement on the effect, although a differing interpretation.

Kahneman and Tversky, however, more directly attack the idea that people are natural frequentists.

He [Gigerenzer] offers a hypothetical example in which a physician in a nonliterate society learns quickly and accurately the posterior probability of a disease given the presence or absence of a symptom. … However, Gigerenzer’s speculation about what a nonliterate physician might learn from experience is not supported by existing evidence. Subjects in an experiment reported by Gluck and Bower (1988) learned to diagnose whether a patient has a rare (25%) or a common (75%) disease. For 250 trials the subjects guessed the patient’s disease on the basis of a pattern of four binary symptoms, with immediate feedback. Following this learning phase, the subjects estimated the relative frequency of the rare disease, given each of the four symptoms separately.

If the mind is “a frequency monitoring device,” as argued by Gigerenzer …, we should expect subjects to be reasonably accurate in their assessments of the relative frequencies of the diseases, given each symptom. Contrary to this naive frequentist prediction, subjects’ judgments of the relative frequency of the two diseases were determined entirely by the diagnosticity of the symptom, with no regard for the base-rate frequencies of the diseases. … Contrary to Gigerenzer’s unqualified claim, the replacement of subjective probability judgments by estimates of relative frequency and the introduction of sequential random sampling do not provide a panacea against base-rate neglect.

Gigerenzer (1996) responds:

Concerning base-rate neglect, Kahneman and Tversky … created the impression that there is little evidence that certain types of frequency formats improve Bayesian reasoning. They do not mention that there is considerable evidence (e.g., Gigerenzer & Hoffrage, 1995) and back their disclaimer principally with a disease-classification study by Gluck and Bower (1988), which they summarized thus: “subjects’ judgments of the relative frequency . . . were determined entirely by the diagnosticity of the symptom, with no regard for the base-rate frequencies of the diseases” … To set the record straight, Gluck and Bower said their results were consistent with the idea that “base-rate information is not ignored, only underused” (p. 235). Furthermore, their study was replicated and elaborated on by Shanks (1991), who concluded that “we have no conclusive evidence for the claim . . . that systematic base-rate neglect occurs in this type of situation” (p. 153). Adding up studies in which base-rate neglect appears or disappears will lead us nowhere.

Gigerenzer is right that Kahneman and Tversky were overly strong in their description of the findings of the Gluck and Bower study, but Gigerenzer’s conclusion seems close to that of Kahneman and Tversky. As Kahneman and Tversky wrote:

[I]t is evident that subjects sometimes use explicitly mentioned base-rate information to a much greater extent than they did in our original engineer- lawyer study [another demonstration of base-rate neglect], though generally less than required by Bayes’ rule.

Argument 2: Can you attribute probabilities to single events?

While I leave the question of frequency representations with a degree of agreement, Gigerenzer has a deeper critique of Kahneman and Tversky’s findings. From his 1993 article:

Is the conjunction fallacy a violation of probability theory? Has a person who chooses T&F violated probability theory? The answer is no, if the person is a frequentist such as Richard von Mises or Jerzy Neyman; yes, if he or she is a subjectivist such as Bruno de Finetti; and open otherwise.

The mathematician Richard von Mises, one of the founders of the frequency interpretation, used the following example to make his point:

We can say nothing about the probability of death of an individual even if we know his condition of life and health in detail. The phrase ‘probability of death’, when it refers to a single person, has no meaning at all for us. This is one of the most important consequences of our definition of probability.

(von Mises, 1957/1928: 11)

In this frequentist view, one cannot speak of a probability unless a reference class has been defined. … Since a person is always a member of many reference classes, no unique relative frequency can be assigned to a single person. … Thus, for a strict frequentist, the laws of probability are about frequencies and not about single events such as whether Linda is a bank teller. There, in this view, no judgement about single events can violate probability theory.

… Seen from the Bayesian point of view, the conjunction fallacy is an error.

Thus, choosing T&F in the Linda problem is not a reasoning error. What has been labelled the ‘conjunction fallacy’ here does not violate the laws of probability. It only looks so from one interpretation of probability.

He writes in his 1991 article somewhat more strongly (here talking in the context of overconfidence):

For a frequentist like the mathematician Richard von Mises, the term “probability”, when it refers to a single event, “has no meaning at all for us” … Probability is about frequencies, not single events. To compare the two means comparing applies with oranges.

Even the major opponents of the frequentists – subjectivists such as Bruno de Finetti – would not generally think of a discrepancy between confidence and relative frequency as a “bias”, albeit for different reasons. For a subjectivist, probability is about single events, but rationality is identified with the internal consistency of subjective probabilities. As de Finetti emphasized, “however an individual evaluates the probability of a particular event, no experience can prove him right, or wrong; nor, in general, could any conceivable criterion give any objective sense to the distinction one would like to draw, here, between right and wrong” …

Kahneman and Tversky address this argument across a few of the biases under debate. First, on conjunction errors:

Whether or not it is meaningful to assign a definite numerical value to the probability of survival of a specific individual, we submit (a) that this individual is less likely to die within a week than to die within a year and (b) that most people regard the preceding statement as true—not as meaningless—and treat its negation as an error or a fallacy.

In response, Gigerenzer makes an interesting point that someone asked that question might make a different inference:

One can easily create a context, such as a patient already on the verge of dying, that would cause a sensible person to answer that this patient is more likely to die within a week (inferring that the question is next week versus the rest of the year, because the question makes little sense otherwise). In the same fashion, the Linda problem creates a context (the description of Linda) that makes it perfectly valid not to conform to the conjunction rule.

I think Gigerenzer is right that if you treat the problem as content-blind you might miss the inference the subjects are drawing from the question (more on content-blind norms below). But conversely, Kahneman and Tversky’s general point appears sound.

Kahneman and Tversky also address this frequentist argument in relation to over-confidence:

Proper use of the probability scale is important because this scale is commonly used for communication. A patient who is informed by his surgeon that she is 99% confident in his complete recovery may be justifiably upset to learn that when the surgeon expresses that level of confidence, she is actually correct only 75% of the time. Furthermore, we suggest that both surgeon and patient are likely to agree that such a calibration failure is undesirable, rather than dismiss the discrepancy between confidence and accuracy on the ground that “to compare the two means comparing apples and oranges”

Gigerenzer’s response here is amusing:

Kahneman and Tversky argued that the reluctance of statisticians to make probability theory of norm of all single events “is not generally shared by the public” (p. 585). If this was meant to shift the burden of justification for their norms from the normative theory of probability to the intuitions of ordinary people, it is exceedingly puzzling. How can people’s intuitions be called upon to substitute for the standards of statisticians, in order to prove that people’s intuitions systematically violate the normative theory of probability?

Kahneman and Tversky did not come back on this particular argument, but several points could be made in their favour. First, and as noted above, there can still be errors under frequentist representations. Even if we discard the results with judgments of probability for single events, there is still a strong case for the use of heuristics leading to the various biases.

Second, if a surgeon states they are confident that someone has a 99% probability of complete recovery when they are right only 75% of the time, they are making one of two errors. Either they are making a probability estimate of a single event, which has no meaning at all according to Gigerenzer and von Mises, or they are poorly calibrated according to Kahneman and Tversky.

Third, whatever the philosophically or statistically correct position, we have a practical problem. We have judgements being made and communicated, with subsequent decisions based on those communications. To the extent there are weaknesses in that chain, we will have sub-optimal outcomes.

Putting this together, I feel this argument leaves us at a philosophical impasse, but Kahneman and Tversky’s angle provides scope for practical application and better outcomes. (Look at the training for the Good Judgment Project and associated improvements in forecasting that resulted).

Argument 3: Are Kahneman and Tversky’s norms content blind?

An interesting question about the norms against which Kahneman and Tversky assess the experimental subjects’ heuristics and biases is whether the norms are blind to the content of the problem. Gigerenzer (1996) writes:

[O]n Kahneman and Tversky’s (1996) view of sound reasoning, the content of the Linda problem is irrelevant; one does not even need to read the description of Linda. All that counts are the terms probable and and, which the conjunction rule interprets in terms of mathematical probability and logical AND, respectively. In contrast, I believe that sound reasoning begins by investigating the content of a problem to infer what terms such as probable mean. The meaning of probable is not reducible to the conjunction rule … For instance, the Oxford English Dictionary … lists “plausible,” “having an appearance of truth,” and “that may in view of present evidence be reasonably expected to happen,” among others. … Similarly, the meaning of and in natural language rarely matches that of logical AND. The phrase T&F can be understood as the conditional “If Linda is a bank teller, then she is active in the feminist movement.” Note that this interpretation would not concern and therefore could not violate the conjunction rule.

This is a case where I believe Gigerenzer makes an interesting point on the specific case but is wrong on the broader point. As a start, in discussing their initial results for their 1983 paper, Kahneman and Tversky asked whether people were interpreting the language in different ways, such as asking whether people are taking “Linda is a bank teller” to mean “Linda is a bank teller and not active in the feminist movement.” They considered the content of their problem and ran different experimental specifications to attempt to understand what was occurring.

But as Kahneman and Tversky state in their postscript, critiquing the Linda problem on this point – and only the within subjects experimental design at that – is a narrow view of their work. The point of the Linda problem is to test whether the representativeness of the description changes the assessment. As they write in their 1996 paper:

Perhaps the most serious misrepresentation of our position concerns the characterization of judgmental heuristics as “independent of context and content” … and insensitive to problem representation … Gigerenzer also charges that our research “has consistently neglected Feynman’s (1967) insight that mathematically equivalent information formats need not be psychologically equivalent” … Nothing could be further from the truth: The recognition that different framings of the same problem of decision or judgment can give rise to different mental processes has been a hallmark of our approach in both domains.

The peculiar notion of heuristics as insensitive to problem representation was presumably introduced by Gigerenzer because it could be discredited, for example, by demonstrations that some problems are difficult in one representation (probability), but easier in another (frequency). However, the assumption that heuristics are independent of content, task, and representation is alien to our position, as is the idea that different representations of a problem will be approached in the same way.

This is a point where you need to look across the full set of experimental findings, rather than critiquing them one-by-one. Other experiments have people violating the conjunction rule while betting on sequences generated by a dice, where there were no such confusions to be had about the content.

Much of the issue is also one of focus. Kahneman and Tversky have certainly investigated the question of how representation changes the approach to a problem. However, it is set out in a different way to that Gigerenzer might have liked.

Argument 4: Should more effort be expended in understanding the underlying cognitive processes or mental models behind these various findings?

We now come to an important point: what is the cognitive process behind all of these results? Gigerenzer (1996) writes:

Kahneman and Tversky (1996) reported various results to play down what they believe is at stake, the effect of frequency. In no case was there an attempt to figure out the cognitive processes involved. …

Progress can be made only when we can design precise models that predict when base rates are used, when not, and why

I can see why Kahneman and Tversky focus on the claims regarding frequency representations  when Gigerenzer makes such strong statements about making biases “disappear”. The statement that in no case have they attempted to figure out the cognitive processes involved is also overly strong, as a case could be made that the heuristics are those processes.

However, Gigerenzer believes Kahneman and Tversky’s heuristics are too vague for this purpose. Gigerenzer (1996) wrote:

The heuristics in the heuristics-and-biases program are too vague to count as explanations. … The reluctance to specify precise and falsifiable process models, to clarify the antecedent conditions that elicit various heuristics, and to work out the relationship between heuristics have been repeatedly pointed out … The two major surrogates for modeling cognitive processes have been (a) one-word-labels such as representativeness that seem to be traded as explanations and (b) explanation by redescription. Redescription, for instance, is extensively used in Kahneman and Tversky’s (1996) reply. … Why does a frequency representation cause more correct answers? Because “the correct answer is made transparent” (p. 586). Why is that? Because of “a salient cue that makes the correct answer obvious” (p. 586). or because it “sometimes makes available strong extensional cues” (p. 589). Researchers are no closer to understanding which cues are more “salient” than others, much less the underlying process that makes them so.

The reader may now understand why Kahneman and Tversky (1996) and I construe this debate at different levels. Kahneman and Tversky centered on norms and were anxious to prove that judgment often deviates from those norms. I am concerned
with understanding the processes and do not believe that counting studies in which people do or do not conform to norms leads to much. If one knows the process, one can design any number of studies wherein people will or will not do well.

This passage by Gigerenzer captures the state of the debate well. However, Kahneman and Tversky are relaxed about the lack of full specification, and sceptical that process models are the approach to provide that detail. They write in the 1996 postscript:

Gigerenzer rejects our approach for not fully specifying the conditions under which different heuristics control judgment. Much good psychology would fail this criterion. The Gestalt rules of similarity and good continuation, for example, are valuable although they do not specify grouping for every display. We make a similar claim for judgmental heuristics.

Gigerenzer legislates process models as the primary way to advance psychology. Such legislation is unwise. It is useful to remember that the qualitative principles of Gestalt psychology long outlived premature attempts at modeling. It is also unwise to dismiss 25 years of empirical research, as Gigerenzer does in his conclusion. We believe that progress is more likely to come by building on the notions of representativeness, availability, and anchoring than by denying their reality.

To me, this is the most interesting point of the debate. I have personally struggled to grasp the precise operation of many of Kahneman and Tversky’s heuristics and how their operation would change across various domains. But are more precisely specified models the way forward? Which are best at explaining the available data? We have now had over 20 years of work since this debate to see if this is an unwise or fruitful pursuit. But that’s a question for another day.

# Barry Schwartz’s The Paradox of Choice: Why More Is Less

I typically find the argument that increased choice in the modern world is “tyrannising” us to be less than compelling. On this blog, I have approvingly quoted Jim Manzi’s warning against extrapolating the results of an experiment on two Saturdays in a particular store – the famous jam experiment – into “grandiose claims about the benefits of choice to society.” I recently excerpted a section from Bob Sugden’s excellent The Community of Advantage: A Behavioural Economist’s Defence of the Market on the idea that choice restriction “appeals to culturally conservative or snobbish attitudes of condescension towards some of the preferences to which markets cater.”

Despite this, I liked a lot of Barry Schwartz’s The Paradox of Choice: Why More Is Less. I still disagree with some of Schwartz’s recommendations, his view that the “free market” undermines our well-being, and that areas such as “education, meaningful work, social relations, medical care” should not be addressed through markets. I believe he shows a degree of condescension toward other people’s preferences. However, I found that for much of the diagnosis of the problem I agreed with Schwartz, even if that doesn’t always extend to recommending the same treatment.

Schwartz’s basic argument is that increased choice can negatively affect our wellbeing. It can damage the quality of our decisions. We often regret our decisions when we see the trade-offs involved in our choice, with those trade-offs often multiplying with increased choice. We adapt to the consequences of our choices, meaning that the high search costs of search may not be recovered.

The result is that we are not satisfied with our choices. Schwartz argues that once our basic needs are met, much of what we are trying to achieve is satisfaction. So if the new car, phone or brand of salad dressing don’t deliver satisfaction, are we worse off?

The power of Schwartz’s argument varies with the domain. When he deals with shopping, it is easy to see that the choices would be overwhelming to someone who wanted to examine all of the options (do we need all 175 salad dressings that are on display?). People report that they are enjoying shopping less, despite shopping more. But it is hard to feel that a decline in our enjoyment of shopping or the confusion we face looking at a sea of salad dressings is a serious problem.

Schwartz spends little time examining the benefits of increased consumer choice for individuals whose preferences are met, or the effect of the accompanying competition on price and quality. Schwartz has another book in which he tackles the problems with markets, so having not read it I can’t say he doesn’t have a case. But that case is absent from The Paradox of Choice.

In fairness to Schwartz, he does state that it is big jump to extrapolate the increased complexity of shopping into claims that too much choice can “tyrannise”. Schwartz even notes that we do OK with many consumer choices. We implicitly execute strategies such as picking the same product each time.

Schwartz’s argument is more compelling when we move beyond consumer goods into important high-stakes decisions such as those about our health, retirement or work. A poor choice there can have large effects on both outcomes and satisfaction. These choices are of a scale that genuinely challenges our wellbeing.

The experimental evidence that we struggle with high-stakes choices is more persuasive evidence of a problem than experiments involving people having difficulty choosing jam. For instance, when confronted with a multitude of retirement plans, people tend to simply split between them rather than consider the merits or appropriate allocation. Tweak the options presented to them and you can markedly change the allocations. When faced with too many choices, they may simply not choose.

Schwartz’s argument about our failures when choosing draws heavily from the heuristics and biases literature, and a relatively solid part of the literature at that: impatience and inter-temporal inconsistency, anchoring and adjustment, availability, framing and so on. But in some ways, this isn’t the heart of Schwartz’s case. People are susceptible to error even when there are few choices, which is the typical scenario in the experiments in which these findings are made. And much of Schwartz’s case would hold even if we were not susceptible to these biases.

Rather, much of the problem that Schwartz identifies comes when we approach choices as maximisers instead of satisficers. Maximisation is the path to disappointment in a world of massive choice, as you will almost certainly not select the best option. Maximisers may not even make a choice as they are not comfortable with compromises and will tend to want to keep looking.

Schwartz and some colleagues created a maximisation scale, where survey respondents rate themselves against statements such as “I never settle for second best.” Those who rated high on the maximisation were less happy with life, less optimistic, more depressed and score high on regret. Why this correlation? Schwartz believes there is a causal role and that learning how to satisfice could increase happiness.

What makes this result interesting is that maximisers make better decisions when assessed objectively. Is objective or subjective success more important? Schwartz considers that once we have met our basic needs, what matters most is how we feel. Subjective satisfaction is the most important criteria.

I am not convinced that the story of satisfaction from particular choices flows into overall satisfaction. Take a particular decision and satisfice, and perhaps satisfaction for that particular decision is higher. Satisfice for every life decision, and what does your record of accomplishment look like? What is your assessment of satisfaction then? At the end of the book, Schwartz does suggest that we need to “choose when to choose”, and leave maximisation for the important decisions, so it seems he feels maximisation is important on some questions.

I also wonder about the second order effects. If everyone satisficed to achieve higher personal satisfaction, what would we lose? How much do we benefit from the refusal of maximisers such as Steve Jobs or Elon Musk to settle for second best. Would a more satisfied world have less of the amazing accomplishments that give us so much value? Even if there were a personal gain to taking the foot off the pedal, would this involve broader cost?

An interesting thread relating to maximisation concerns opportunity costs. Any economist will tell you that opportunity cost – the opportunity you forgo by choosing an option – is the benchmark against which options should be assessed. But Schwartz argues that assessing opportunity costs has costs in itself. Being forced to make decisions with trade-offs makes people unhappy, and considering the opportunity costs makes those trade-offs salient.

The experimental evidence on considering trade-offs is interesting. For instance, in one experiment a groups of doctors were given a case history and a choice between sending the patient to a specialist or trying one other medication first. 75% choose the medication. Give the same choice to another group of doctors, but with the addition of a second medication option, and this time only 50% chose medication. Choosing the specialist is a way of avoiding a decision between the two medications. When there are trade-offs, all options can begin to look unappealing.

Another problem lurking for maximisers is regret, as the only way to avoid regret is to make the best possible decision. People will often avoid decisions if they could cause regret, or they aim for the regret minimising decision (which might be considered a form of satisficing).

There are some problems that arise even without the maximisation mindset.  One is that expectations may increase with choice. Higher expectations create a higher benchmark to achieve satisfaction, and Schwartz argues that these expectations may lead to an inability to cope rather than more control. High expectations create the burden of meeting them. For example, job options are endless. You can live anywhere in the world. The nature of your relationships – such as decisions about marriage – have a flexibility far above that of our recent past. For many, this creates expectations that are unlikely to be met. Schwartz does note the benefits of these options, but the presence of a psychological cost means the consequences are not purely positive.

Then there is unanticipated adaptation. People tend to predict bigger hypothetical changes in their satisfaction than that reported by those who experienced the events. Schwartz draws on the often misinterpreted paper that compares the happiness of lottery winners with para- and quadriplegics. He notes that the long-term difference in happiness between the two groups is smaller than you would expect (although I am not sure what you would expect on a 5-point scale). The problem with unanticipated adaptation is that the cost of search does not get balanced by the subjective benefit that the chooser was anticipating.

So what should we do? Schwartz offers eleven steps to reduce the burden of choosing. Possibly the most important is the need to choose when to choose. Often it is not that any particular choice is problematic (although some experiments suggest they are). Rather, it is the cumulative effect that is most damaging. Schwartz suggests picking those decisions that you want to invest effort in. Choosing when to choose allows adequate time and attention when we really want to choose. I personally do this: a wardrobe full of identical work shirts (although this involved a material initial search cost), a regular lunch spot, and many other routines.

Schwartz also argues that we should consider the opportunity costs of considering opportunity costs. Being aware of all the possible trade-offs, particularly when no option can dominate on all dimensions, is a recipe for disappointment. Schwartz suggests being a satisficer and only consider other options when you need to.

The final recommendation I will note is the need to anticipate adaptation. I personally find this a useful tool. Whenever I am making a new purchase I tend to recall a paragraph in Geoffrey Miller’s Spent, which often changes my view on a purchase:

You anticipate the minor mall adventure: the hunt for the right retail environment playing cohort-appropriate nostalgic pop, the perky submissiveness of sales staff, the quest for the virgin product, the self-restraint you show in resisting frivolous upgrades and accessories, the universe’s warm hug of validation when the debit card machine says “Approved,” and the masterly fulfillment of getting it home, turned on, and doing one’s bidding. The problem is, you’ve experienced all this hundreds of times before with other products, and millions of other people will experience it with the same product. The retail adventure seems unique in prospect but generic in retrospect. In a week, it won’t be worth talking about.

Miller’s point in that paragraph was about the signalling benefits of consumerism, but I find a similar mindset useful when thinking about the adaptation that will occur.

# Gary Klein’s Sources of Power: How People Make Decisions

Summary: An important book describing how many experts make decisions, but with a lingering question mark about how good these decisions actually are.

—-

Gary Klein’s Sources of Power: How People Make Decisions is somewhat of a classic, with the version I read being a 20th anniversary edition issued by MIT Press. Klein’s work on expert decision making has reached a broad audience through Malcolm Gladwell’s Blink, and Klein’s adversarial collaboration with Daniel Kahneman (pdf) has given his work additional academic credibility.

However, throughout the growing application of behavioural science in public policy and the private sphere, I have rarely seen Klein’s work practically applied to improve decision making. The rare occasions where I see it referenced typically involve an argument that the conditions for the development of expertise do not exist in a particular domain.

This lack of application partly reflects the target of Klein’s research. Sources of Power is an exploration of what Klein calls naturalistic decision making. Rather than studying novices performing artificial tasks in the laboratory, naturalistic decision making involves the study of experienced decision makers performing realistic tasks. Klein’s aim is to document the strengths and capabilities of decision makers in natural environments with high stakes, such as lost lives or millions of dollars down the drain. It often involves uncertainty or missing information. The goals may be unclear. Klein’s focus is therefore in the field and the decisions of people such as firefighters, nurses, pilots and military personnel. They are typically people who have had many years of experience. They are “experts”.

The exploration of naturalistic decision making contrasts with the heuristics and biases program, which typically focuses on the limitations of decision makers and is the staple fodder of applied behavioural scientists. Using the findings of experimental outputs from the heuristics and biases program to tweak decision environments and measure the response across many decision makers (typically through a randomised controlled trial) is more tractable than exploring, modifying and measuring the effect of interventions to improve the rare, high-stakes decisions of experts in environments where the goal itself might not even be clear.

Is Klein’s work “science”?

The evidence that shapes Sources of Power was typically obtained through case interviews with decision makers and by observing these decision makers in action. There are no experiments, with the data obtained through interviews. The interviews are coded for analysis to attempt to find patterns in the approaches of the decision makers.

Klein is cognisant of the limitations of this approach. He notes that he gives detailed descriptions of each study so that we can judge the weaknesses of his approach ourselves. This brings his approach closer to what he considers to be a scientific piece of research. Klein writes:

What are the criteria for doing a scientific piece of research? Simply, that the data are collected so that others can repeat the study and that the inquiry depends on evidence and data rather than argument. For work such as ours, replication means that others could collect data the way we have and could also analyze and code the results as we have done.

The primary “weakness” of his approach is the reliance on observational data, not experiments. As Klein suggests, there are plenty of other sciences that have this feature. His approach is closer to anthropology that psychology. But obviously, an approach constrained to the laboratory has its own limitations:

Both the laboratory methods and the field studies have to contend with shortcomings in their research programs. People who study naturalistic decision making must worry about their inability to control many of the conditions in their research. People who use well-controlled laboratory paradigms must worry about whether their findings generalize outside the laboratory.

Klein has a faith in stories (the subject of one of the chapters) serving as natural experiments linking a network of causes to their effects. It is a fair point that stories can be used to communicate subtle points of expertise, but using them to reliably identify cause-effect relationships seems a step too far.

Recognition-primed decision making

Klein’s “sources of power” for decision-making by experts are intuition, mental simulation, metaphor and storytelling. This is in contrast to what might be considered a more typical decisions-making toolkit (the one you are more likely to be taught) of logical thinking, probabilistic analysis and statistics.

Klein’s workhorse model integrating these sources of power is recognition-primed decision making. This is a two stage process, involving an intuitive recognition of what response is required, followed by mental simulation of the response to see if it will work. Metaphors and storytelling are mental simulation tools. The recognition-primed model involves a blend of intuition and analysis, so is not just sourced from gut feelings.

From the perspective of the decision maker, someone using this model might not consider that they are making a decision. They are not generating options and then evaluating them to determine the best choice.

Instead, they would see their situation as a prototype for which they know the typical course of action right away. As their experience allowed them to generate a reasonable response at the first instance, they do not need to think of others. They simply evaluate the first option, and if suitable, execute. A decision was made in that alternative courses of action were available and could have been chosen. But there was no explicit examination across options.

Klein calls this process singular evaluation, as opposed to comparative evaluation. Singular evaluation may involve moving through multiple options, but each is considered on its own merits sequentially until a suitable option is found, with the search stopping at that point.

The result of this process is “satisficing”, a term coming from Herbert Simon. These experts do not optimise. They pick the first option that works.

Klein’s examination of various experts found that the recognition-primed decision model was the dominant mode of decision making, despite his initial expectation of comparative evaluation. For instance, fireground commanders used recognition-primed decision making for around 80% of the decisions that Klein’s team examined. Klein also points to similar evidence of decision making by chess grandmasters, who spend little time comparing the strengths and weaknesses of one move to another. Most of their time involves simulating the consequences and rejecting moves.

Mental simulation

Mental simulation involves the expert imagining the situation and transforming the situation until can they picture it in a different way from the start. Mental simulations are typically not overly elaborate, and generally rely on just a few factors (rarely more than three). The expert runs the simulation and assesses: can it pass an internal evaluation? Sometimes mental simulation can be wrong, but Klein considers them to be fairly accurate.

Klein’s examples of mental simulation were not always convincing. For example, he describes an economist who mentally simulated what the Polish economy would do following interventions to reduce inflation. It is hard to take seriously single examples of such mental simulation hitting the mark when I am aware of so many backfires in this space. And how would expertise in such economic simulations develop? (More on developing expertise below.)

One strength of simulations is that they can be used where traditional decision analytic strategies do not apply. You can use simulations (or stories) if you cannot otherwise remember every piece of information. Klein points to evidence that this is how juries absorb evidence.

One direct use of simulation is the premortem strategy. Imagine in the future plan has failed and you have to understand why. You can also do simulation through decision scenarios.

Novices versus experts

Expertise has many advantages. Klein notes experts can see the world differently, have more procedures to apply, notice problems more quickly, generate richer mental simulations and have more analogies to draw on. Experts can see things that novices can’t. They can see anomalies, violations of expectancies, the big picture, how things work, additional opportunities and improvisations, future events, small differences, and their own limitations.

Interestingly, while experts tend not to carefully deliberate about the merits of different courses of action, novices need to compare different approaches. Novices are effectively thinking through the problem from scratch. The rational choice method helps us when we lack the expertise to assess a situation.

Another contrast is where effort is expended. Experts spend most of their effort on situation assessment – this gives the answers. Novices spend more time on determining the course of action.

One interesting thread concerned what happened when time pressure was put on chess players. Time constraints barely degraded the performance of masters, while it destroyed that of novices. The masters often came up with their best move first, so there is no need for the time to test a lot of options.

Developing good decision making

Given the differences between novices and experts, how should novices develop good decision making? Klein suggests this should not be done through training in formal methods of analysis. In fact, this could get in the way of developing expertise. There is also no need to teach the recognition-primed model as it is descriptive: it shows what good decision makers already do. We shouldn’t teach people to think like experts.

Rather, we should teach people to learn like experts. They should engage in deliberate practice, obtain feedback that is accurate and timely, and enrich learning by reviewing prior experience and examining mistakes. The intuition that drives recognition grows out of experience.

Recognition versus analytical methods

Klein argues that recognition strategies are not a substitute for analytical methods, but an improvement. Analytical methods are the fallback for those without experience.

Klein sees a range of environments where recognition strategies will be the superior options. These include the presence of time pressure, when the decision maker is experienced in the domain, when conditions are dynamic (meaning effort can be rendered useless if conditions shift), and when the goals ill-defend (making it hard to develop evaluation criteria). Comparative evaluation is more useful where people have to justify choice, where it is required for conflict resolution, where you are trying to optimise (as opposed to finding just workable option), and where the decision is computationally complex (e.g. investment portfolio).

From this, it is hard to use a rigorous analytical approach in many natural settings. Rational, linear approaches run into problems when the goal is shifting or ill-defined.

Diagnosing poor decisions

I previously posted some of Klein’s views on the heuristics and biases approach to assessing decision quality. Needless to say, Klein is sceptical that poor decisions are largely due to faulty reasoning. More effort should be expended in finding the sources of poor decisions, rather than blaming the operator.

Klein describes a review a sample of 25 decisions with poor outcomes (from 600 he had available) to assess what went wrong. Sixteen outcomes were due to lack of experience, such as someone not realising that construction of the building on fire was problematic. The second most common issue was lack of information. The third most common involved noticing but explaining away problems during mental simulation – possibly involving bias.

Conditions for expertise

The conditions for developing the expertise for effective recognition-primed decision making is delved into in depth in Klein’s article with Daniel Kahneman, Conditions for Intuitive Exertise: A Failure to Disagree (pdf). However, Klein does examine this area to some degree in Sources of Power.

Klein notes that it is one thing to gain experience, and another to turn that into expertise.  It is often difficult to see cause and effect relationships. There is typically delay between the two. It is difficult to disentangle luck and skill. Drawing on work by Jim Shanteau, Klein also notes that expertise was hard to develop when the domain is dynamic, we need to predict human behaviour, there is less chance for feedback, there is not enough repetition to get sense of typicality or there are fewer trials. Funnily enough, this description seems to align somewhat with many of the naturalistic decision making environments.

Despite these barriers, Klein believes that it is possible to get expertise in some areas, such as fighting fires, caring for hospitalised infants or flying planes. Less convincingly (given some research in the area), he also references the fine discrimination of wine tasters (e.g.).

Possibly my biggest criticism of Klein’s book relates to this final point, as he provides little evidence for the conversion of experience into expertise beyond the observation that in many of these domains novices are completely lost. Is the best benchmark a comparison with a novice who has no idea, or is it better to look at, say, a simple algorithm, statistical rule, or someone with basic training?