Uncategorized

Opposing biases

From the preface of one print of Philip Tetlock’s Expert Political Judgement (hat tip to Robert Wiblin who quoted this passage in the introduction to an 80,000 hours podcast episode):

The experts surest of their big-picture grasp of the deep drivers of history, the Isaiah Berlin–style “hedgehogs,” performed worse than their more diffident colleagues, or “foxes,” who stuck closer to the data at hand and saw merit in clashing schools of thought. That differential was particularly pronounced for long-range forecasts inside experts’ domains of expertise.

Hedgehogs were not always the worst forecasters. Tempting though it is to mock their belief-system defenses for their often too-bold forecasts—like “off-on-timing” (the outcome I predicted hasn’t happened yet, but it will) or the close-call counterfactual (the outcome I predicted would have happened but for a fluky exogenous shock)—some of these defenses proved quite defensible. And, though less opinionated, foxes were not always the best forecasters. Some were so open to alternative scenarios (in chapter 7) that their probability estimates of exclusive and exhaustive sets of possible futures summed to well over 1.0. Good judgment requires balancing opposing biases. Over-confidence and belief perseverance may be the more common errors in human judgment but we set the stage for over-correction if we focus solely on these errors and ignore the mirror image mistakes, of under-confidence and excessive volatility.

I can see why this idea of opposing biases makes correction of “biases” difficult.

But before we get to the correction of biases, this concept of opposing biases points at a major difficulty with behavioural analyses of decision making. When you have, say, both loss aversion and overconfidence in your bag of explanations for poor decision making, you can explain almost anything after the fact. The gamble turned out poorly? Overconfidence. Didn’t take the gamble? Loss aversion.

Recently I’ve heard a lot of people talking of action bias. There is also a status quo bias. Again, a pair of biases with which we can explain anything.

Hypotheticals versus the real world: The trolley problem

Trolley_problem

Image by McGeddon

Daniel Engber writes:

Picture the following situation: You are taking a freshman-level philosophy class in college, and your professor has just asked you to imagine a runaway trolley barreling down a track toward a group of five people. The only way to save them from being killed, the professor says, is to hit a switch that will turn the trolley onto an alternate set of tracks where it will kill one person instead of five. Now you must decide: Would the mulling over of this dilemma enlighten you in any way?

I ask because the trolley-problem thought experiment described above—and its standard culminating question, Would it be morally permissible for you to hit the switch?—has in recent years become a mainstay of research in a subfield of psychology. …

For all this method’s enduring popularity, few have bothered to examine how it might relate to real-life moral judgments. Would your answers to a set of trolley hypotheticals correspond with what you’d do if, say, a deadly train were really coming down the tracks, and you really did have the means to change its course? In November 2016, though, Dries Bostyn, a graduate student in social psychology at the University of Ghent, ran what may have been the first-ever real-life version of a trolley-problem study in the lab. In place of railroad tracks and human victims, he used an electroschock machine and a colony of mice—and the question was no longer hypothetical: Would students press a button to zap a living, breathing mouse, so as to spare five other living, breathing mice from feeling pain?

“I think almost everyone within this field has considered running this experiment in real life, but for some reason no one ever got around to it,” Bostyn says. He published his own results last month: People’s thoughts about imaginary trolleys and other sacrificial hypotheticals did not predict their actions with the mice, he found.

Om what this finding means for the trolley problem:

If people’s answers to a trolley-type dilemma don’t match up exactly with their behaviors in a real-life (or realistic) version of the same, does that mean trolleyology itself has been derailed? The answer to that question depends on how you understood the purpose of those hypotheticals to begin with. Sure, they might not predict real-world actions. But perhaps they’re still useful for understanding real-world reactions. After all, the laboratory game mirrors a common experience: one in which we hear or read about a thing that someone did—a policy that she enacted, perhaps, or a crime that she committed—and then decide whether her behavior was ethical. If trolley problems can illuminate the mental process behind reading a narrative and then making a moral judgment then perhaps we shouldn’t care so much about what happened when this guy in Belgium pretended to be electrocuting mice.

[Joshua Greene] says, Bostyn’s data aren’t grounds for saying that responses to trolley hypotheticals are useless or inane. After all, the mouse study did find that people’s answers to the hypotheticals predicted their actual levels of discomfort. Even if someone’s feeling of discomfort may not always translate to real-world behavior, that doesn’t mean that it’s irrelevant to moral judgment. “The more sensible conclusion,” Greene added over email, “is that we are looking at several weakly connected dots in a complex chain with multiple factors at work.”

Bostyn’s mice aside, there are other reasons to wary of the trolley hypotheticals. For one thing, a recent international project to reproduce 40 major studies in the field of experimental philosophy included stabs at two of Greene’s highly cited trolley-problem studies. Both failed to replicate.

I recommend reading the whole article.

Explaining the hot-hand fallacy fallacy

Since first coming across Joshua Miller and Adam Sanurjo’s great work demonstrating that the hot-hand fallacy was itself a fallacy, I’ve been looking for a good way to explain simply the logic behind their argument. I haven’t found something that completely hits the mark yet, but the following explanation from Miller and Sanjurjo in The Conversation might be useful to some:

In the landmark 1985 paper “The hot hand in basketball: On the misperception of random sequences,” psychologists Thomas Gilovich, Robert Vallone and Amos Tversky (GVT, for short) found that when studying basketball shooting data, the sequences of makes and misses are indistinguishable from the sequences of heads and tails one would expect to see from flipping a coin repeatedly.

Just as a gambler will get an occasional streak when flipping a coin, a basketball player will produce an occasional streak when shooting the ball. GVT concluded that the hot hand is a “cognitive illusion”; people’s tendency to detect patterns in randomness, to see perfectly typical streaks as atypical, led them to believe in an illusory hot hand.

In what turns out to be an ironic twist, we’ve recently found this consensus view rests on a subtle – but crucial – misconception regarding the behavior of random sequences. In GVT’s critical test of hot hand shooting conducted on the Cornell University basketball team, they examined whether players shot better when on a streak of hits than when on a streak of misses. In this intuitive test, players’ field goal percentages were not markedly greater after streaks of makes than after streaks of misses.

GVT made the implicit assumption that the pattern they observed from the Cornell shooters is what you would expect to see if each player’s sequence of 100 shot outcomes were determined by coin flips. That is, the percentage of heads should be similar for the flips that follow streaks of heads, and the flips that follow streaks of misses.

Our surprising finding is that this appealing intuition is incorrect. For example, imagine flipping a coin 100 times and then collecting all the flips in which the preceding three flips are heads. While one would intuitively expect that the percentage of heads on these flips would be 50 percent, instead, it’s less.

Here’s why.

Suppose a researcher looks at the data from a sequence of 100 coin flips, collects all the flips for which the previous three flips are heads and inspects one of these flips. To visualize this, imagine the researcher taking these collected flips, putting them in a bucket and choosing one at random. The chance the chosen flip is a heads – equal to the percentage of heads in the bucket – we claim is less than 50 percent.

Caption: The percentage of heads on the flips that follow a streak of three heads can be viewed as the chance of choosing heads from a bucket consisting of all the flips that follow a streak of three heads. Miller and Sanjurjo, CC BY-ND

To see this, let’s say the researcher happens to choose flip 42 from the bucket. Now it’s true that if the researcher were to inspect flip 42 before examining the sequence, then the chance of it being heads would be exactly 50/50, as we intuitively expect. But the researcher looked at the sequence first, and collected flip 42 because it was one of the flips for which the previous three flips were heads. Why does this make it more likely that flip 42 would be tails rather than a heads?

Caption: Why tails is more likely when choosing a flip from the bucket. Miller and Sanjurjo, CC BY-ND

If flip 42 were heads, then flips 39, 40, 41 and 42 would be HHHH. This would mean that flip 43 would also follow three heads, and the researcher could have chosen flip 43 rather than flip 42 (but didn’t). If flip 42 were tails, then flips 39 through 42 would be HHHT, and the researcher would be restricted from choosing flip 43 (or 44, or 45). This implies that in the world in which flip 42 is tails (HHHT) flip 42 is more likely to be chosen as there are (on average) fewer eligible flips in the sequence from which to choose than in the world in which flip 42 is heads (HHHH).

This reasoning holds for any flip the researcher might choose from the bucket (unless it happens to be the final flip of the sequence). The world HHHT, in which the researcher has fewer eligible flips besides the chosen flip, restricts his choice more than world HHHH, and makes him more likely to choose the flip that he chose. This makes world HHHT more likely, and consequentially makes tails more likely than heads on the chosen flip.

In other words, selecting which part of the data to analyze based on information regarding where streaks are located within the data, restricts your choice, and changes the odds.

There are a few other pieces in the article that make it worth reading, but here is an important punchline to the research:

Because of the surprising bias we discovered, their finding of only a negligibly higher field goal percentage for shots following a streak of makes (three percentage points), was, if you do the calculation, actually 11 percentage points higher than one would expect from a coin flip!

An 11 percentage point relative boost in shooting when on a hit-streak is not negligible. In fact, it is roughly equal to the difference in field goal percentage between the average and the very best 3-point shooter in the NBA. Thus, in contrast with what was originally found, GVT’s data reveal a substantial, and statistically significant, hot hand effect

Wealth and genes

Go back ten years, and most published attempts to link specific genetic variants to a trait were false. These candidate-gene studies were your classic, yet typically rubbish, “gene for X” paper.

The proliferation of poor papers was in part because the studies were too small to discover the effects they were looking for (see here for some good videos describing the problems). As has become increasingly evident, most human traits are affected by thousands of genes, each with tiny effects. With a small sample – many of the early candidate-gene studies involved hundreds of people – all you can discover is noise.

But there was some optimism that robust links would eventually be drawn. Get genetic samples from a large enough population (say, hundreds of thousands), and you can detect these weak genetic effects. You can also replicate the findings across multiple samples to ensure the results are robust

In recent years that promise has started to be realised through genome-wide association studies (GWAS). Although more than 99% of the human genome is common across people, there are certain locations at which the DNA base pair can differ. These locations are known as single-nucleotide polymorphisms (SNPs). A GWAS involves looking across all of the sampled SNPs (typically one million or so SNPs for each person) and estimating the effect of each SNP against an outcome of interest. Those SNPs that meet certain statistical thresholds are treated as positive findings.

A steady flow of GWAS papers are now being published, linking SNPs with traits such as cognitive function and outcomes such as educational attainment. A typical study title is “Study of 300,486 individuals identifies 148 independent genetic loci influencing general cognitive function“.

One innovation from this work is the use of “polygenic scores”. The effect of all measured SNPs from a GWAS is used to produce a single score for a person. That score is used to predict their trait or outcome. Polygenic scores are used regularly in animal breeding, and are now starting to be used to look at human outcomes, including those of interest to economists.

The latest example of this is an examination of the link between wealth and a polygenic score for education. An extract from the abstract of the NBER working paper by Daniel Barth, Nicholas Papageorge and Kevin Thom states:

We show that genetic endowments linked to educational attainment strongly and robustly predict wealth at retirement. The estimated relationship is not fully explained by flexibly controlling for education and labor income. … The associations we report provide preliminary evidence that genetic endowments related to human capital accumulation are associated with wealth not only through educational attainment and labor income, but also through a facility with complex financial decision-making.

(If you can’t access the NBER paper, here is an ungated pdf of a slightly earlier working paper)

In more detail:

We first establish a robust relationship between household wealth in retirement and the average household polygenic score for educational attainment. A one-standard-deviation increase in the score is associated with a 33.1 percent increase in household wealth (approximately $144,000 in 2010 dollars). … Measures of educational attainment, including years of education and completed degrees, explain over half of this relationship. Using detailed income data from the Social Security Administration (SSA) as well as self-reported labor earnings from the HRS, we find that labor income can explain only a small part of the gene-wealth gradient that remains after controlling for education. These results indicate that while education and labor market earnings are important sources of variation in house-hold wealth, they explain only a portion of the relationship between genetic endowments and wealth.

The finding that the genes that affect education also affect other outcomes – in this case wealth – is no surprise. Whether these genes relate to, say, cognitive ability or conscientiousness, it is easy to imagine that they affect all of education, workplace performance, savings behaviour and a host of other factors that would in turn influence wealth.

To tease this out, I would be interested in seeing studies that examine the predictive power of polygenic scores for more fundamental characteristics, such as IQ and the big five personality traits. These would likely capture a good deal of the variation in outcomes being attributed to education. You might also look at some fundamental economic traits, such as risk or time preferences (to the extent these are not just reflections of IQ and the big five). If you know these more fundamental traits, most other behaviours are simply combinations of that.

This was a lesson learnt from research on heritability, where you could find studies calculating the heritability of everything from opinions on gun control to leisure interests. Although this had some value in that it led to the first law of behavioural genetics, namely that all human behavioural traits are heritable, a lot of these studies were simply capturing manifestations of differences in IQ and the big five. (It also benefited academics with padded CVs).

Moving on, what does analysis using polygenic scores add to other work?

Our work contributes to an existing literature on endowments, economic traits, and household wealth. One strand of this work examines how various measures of “ability,” such as IQ or cognitive test scores, predict household wealth and similar outcomes … However, parental investments and other environmental factors can directly affect test performance, making it difficult to separate the effects of endowed traits from endogenous human capital investments. A second strand of this literature focuses on genetic endowments, and seeks to estimate their collective importance using twin studies. Twin studies have shown that genetics play a non-trivial role in explaining financial behavior such as savings and portfolio choices … However, while twin studies can decompose the variance of an outcome into genetic and non-genetic contributions, they do not identify which particular markers influence economic outcomes. Moreover, it is typically impossible to apply twin methods to large and nationally representative longitudinal studies, such as the HRS, which offer some of the richest data on household wealth and related behavioral traits.

Twin studies are fantastic at teasing out the role of genetics, but if you want to take genetic samples from a new population and use the genetic markers as controls in your analysis or to predict outcomes, you need something of the nature of these polygenic scores.

We note two important differences between the EA score and a measure like IQ that make it valuable to study polygenic scores. First, a polygenic score like the EA score can overcome some interpretational challenges related to IQ and other cognitive test scores. Environmental factors have been found to influence intelligence test results and to moderate genetic influences on IQ (Tucker-Drob and Bates, 2015). It is true that differences in the EA score may reflect differences in environments or investments because parents with high EA scores may also be more likely to invest in their children. However, the EA score is fixed at conception, which means that post-birth investments cannot causally change the value of the score. A measure like IQ suffers from both of these interpretational challenges.

The interpretational challenge with IQ doesn’t need to be viewed in isolation. Between twin and adoption studies and these studies, you can start to tease out how much a measure like IQ is practically (as opposed to theoretically) hampered by those challenges. An even better option might be an IQ polygenic score.

The paper ends with a warning that we know should have been attached to many papers for decades now, but this time with an increasingly tangible solution.

Economic research using information on genetic endowments is useful for understanding what has heretofore been a form of unobserved heterogeneity that persists across generations, since parents provide genetic material for their children. Studies that ignore this type of heterogeneity when studying the intergenerational persistence of economic outcomes, such as income or wealth, could place too much weight on other mechanisms such as attained education or direct monetary transfers between parents and children. The use of observed genetic information helps economists to develop a more accurate and complete understanding of inequality across generations.

Examining intergenerational outcomes while ignoring genetic effects is generally a waste of time.

Is the marshmallow test just a measure of affluence?

MischelI argued in a recent post that the conceptual replication of the marshmallow test was largely successful. A single data point – whether someone can wait for a larger reward – predicts future achievement.

That replication has generated a lot of commentary. Most concerns the extension to the original study, an examination of whether the marshmallow test retained its predictive power if they accounted for factors such as the parent and child’s background (including socioeconomic status), home environment, and measures of the child’s behavioural and cognitive development.

The result was that these “controls” eliminated the predictive power of the marshmallow test. If you know those other variables, the marshmallow test does not give you any further information.

As I said before, this is hardly surprising. They used around 30 controls – 14 for child and parent background, 9 for the quality of the home environment, 5 for childhood achievement and 2 for behavioural characteristics. It is likely that many of them capture the features that give the marshmallow test its predictive power.

So can we draw any conclusions from the inclusion of those particular controls? One of the most circulated interpretations is by Jessica Calarco in the Atlantic, titled Why Rich Kids Are So Good at the Marshmallow Test. The subtitle is “Affluence—not willpower—seems to be what’s behind some kids’ capacity to delay gratification”. Calarco writes:

Ultimately, the new study finds limited support for the idea that being able to delay gratification leads to better outcomes. Instead, it suggests that the capacity to hold out for a second marshmallow is shaped in large part by a child’s social and economic background—and, in turn, that that background, not the ability to delay gratification, is what’s behind kids’ long-term success.

This conclusion is a step too far. For a start, controlling for child background and home environment (slightly more than) halved the predictive power of the marshmallow test. It did not eliminate it. It was only on including additional behavioural and cognitive controls – characteristics of the child themselves – that the predictive power of the marshmallow test was eliminated

But the more interesting question in one of causation. Are the social and economic characteristics themselves the cause of later achievement?

One story we could tell is that the social and economic characteristics are simply proxies for parental characteristics, which are genetically transmitted to the children. Heritability of traits such as IQ tend to increase with age, so parental characteristics would likely have predictive power in addition to that of the four-year old’s cognitive and behavioural skills.

On the flipside, maybe the behavioural and cognitive characteristics of the child are simply reflections of the development environment that the child has been exposed to date. This is effectively Calarco’s interpretation.

Which is the right interpretation? This study doesn’t help answer this question. It was never designed to. As lead study author Tyler Watts tweeted in response to the Atlantic article:

If you want to know whether social and economic background causes future success, you should look elsewhere. (I’d start with twin and adoption studies.)

That said, there were a couple of interesting elements to this new study. While the marshmallow test was predictive of future achievement at age 15, there was no association between the marshmallow test and two composite measure of behaviours at 15. The composite behaviour measures were for internalising behaviours (such as depression) and externalising behaviours (such as anti-social behaviours). This inability to predict future behavioural problems hints that the marshmallow test may obtain its predictive power through the cognitive rather than the behavioural channel.

This possibility is also suggested by the correlation between the marshmallow test and the Applied Problems test, which requires the children to count and solve simple addition problems.

[T]he marshmallow test had the strongest correlation with the Applied Problems subtest of the WJ-R, r(916) = .37, p < .001; and correlations with measures of attention, impulsivity, and self-control were lower in magnitude (rs = .22–.30, p < .001). Although these correlational results were far from conclusive, they suggest that the marshmallow test should not be thought of as a mere behavioral proxy for self-control, as the measure clearly relates strongly to basic measures of cognitive capacity.

Not conclusive, but it points to some areas worth further exploring.

PS: After writing this post (I usually post on delay of between a week and three months), Robert VerBruggen posted a piece at the Institute for Family Studies, making many of the same points. I would have skipped writing the new content – and simply quoted VerBruggen – if I’d seen it earlier. Inside Higher Ed also has a good write-up by Greg Toppo, including this quote from Walter Mischel:

[A] child’s ability to wait in the ‘marshmallow test’ situation reflects that child’s ability to engage various cognitive and emotion-regulation strategies and skills that make the waiting situation less frustrating. Therefore, it is expected and predictable, as the Watts paper shows, that once these cognitive and emotion-regulation skills, which are the skills that are essential for waiting, are statistically ‘controlled out,’ the correlation is indeed diminished.

Also from Mischel:

Unfortunately, our 1990 paper’s own cautions to resist sweeping over-generalizations, and the volume of research exploring the conditions and skills underlying the ability to wait, have been put aside for more exciting but very misleading headline stories over many years.

PPS: In another thread to her article, Calarco draws on the concept of scarcity:

There’s plenty of other research that sheds further light on the class dimension of the marshmallow test. The Harvard economist Sendhil Mullainathan and the Princeton behavioral scientist Eldar Shafir wrote a book in 2013, Scarcity: Why Having Too Little Means So Much, that detailed how poverty can lead people to opt for short-term rather than long-term rewards; the state of not having enough can change the way people think about what’s available now. In other words, a second marshmallow seems irrelevant when a child has reason to believe that the first one might vanish.

I’ve written about scarcity previously in my review of Mullainathan and Shafir’s book. I’m not sure the work on scarcity sheds light on the marshmallow test results. The concept behind scarcity is that poverty-related concerns consume mental bandwidth that isn’t then available for other tasks. A typical experiment to demonstrate scarcity involves priming the experimental subjects with a problem before testing their IQ. When the problem has a large financial cost (e.g. expensive car repairs), the performance of low-income people plunges. Focusing their attention on their lack of resources consumes mental bandwidth. On applying this to the marshmallow test, I haven’t seen much evidence four-year olds are struggling with this problem.

(As an aside, scarcity seems to be the catchall response to discussions of IQ and achievement, a bit like epigenetics is the response to any discussion of genetics.)

Given Calarco’s willingness to bundle the marshmallow test replication into the replication crisis (calling it a “failed replication”), its worth also thinking about scarcity in that light. If I had to predict which results would not survive a pre-registered replication, the experiments in the original scarcity paper are right up there. They involve priming, the poster-child for failed replications. The size of the effect, 13 IQ points from a simple prime, fails the “effect is too large” heuristic.

Then there is a study that looked at low-income households before and after payday, which found no change in cognitive function either side of that day (you could consider this a “conceptual replication”). In addition, for a while now I have been hearing rumours of file drawers containing failed attempts to elicit the scarcity mindset. I was able to find one pre-registered direct replication, but it doesn’t seem the result has been published. (Sitting in a file drawer somewhere?)

There was even debate around whether the original scarcity paper (pdf) showed the claimed result. Reanalysis of the data without dichotomising income (splitting it into two bands rather than treating it as a continuous variable) eliminated the effect. The original authors managed to then resurrect the effect (pdf) by combining the data from three experiments, but once you are at this point, you have well and truly entered the garden of forking paths.

Does a moral reminder decrease cheating?

ArielyIn The (Honest) Truth About Dishonesty, Dan Ariely describes an experiment to determine how much people cheat:

[P]articipants entered a room where they sat in chairs with small desks attached (the typical exam-style setup). Next, each participant received a sheet of paper containing a series of twenty different matrices … and were told that their task was to find in each of these matrices two numbers that added up to 10 …

We also told them that they had five minutes to solve as many of the twenty matrices as possible and that they would get paid 50 cents per correct answer (an amount that varied depending on the experiment). Once the experimenter said, “Begin!” the participants turned the page over and started solving these simple math problems as quickly as they could. …

Here’s an example matrix:

matrix

This was how the experiment started for all the participants, but what happened at the end of the five minutes was different depending on the particular condition.

Imagine that you are in the control condition… You walk up to the experimenter’s desk and hand her your solutions. After checking your answers, the experimenter smiles approvingly. “Four solved,” she says and then counts out your earnings. … (The scores in this control condition gave us the actual level of performance on this task.)

Now imagine you are in another setup, called the shredder condition, in which you have the opportunity to cheat. This condition is similar to the control condition, except that after the five minutes are up the experimenter tells you, “Now that you’ve finished, count the number of correct answers, put your worksheet through the shredder at the back of the room, and then come to the front of the room and tell me how many matrices you solved correctly.” …

If you were a participant in the shredder condition, what would you do? Would you cheat? And if so, by how much?

With the results for both of these conditions, we could compare the performance in the control condition, in which cheating was impossible, to the reported performance in the shredder condition, in which cheating was possible. If the scores were the same, we would conclude that no cheating had occurred. But if we saw that, statistically speaking, people performed “better” in the shredder condition, then we could conclude that our participants overreported their performance (cheated) when they had the opportunity to shred the evidence. …

Perhaps somewhat unsurprisingly, we found that given the opportunity, many people did fudge their score. In the control condition, participants solved on average four out of the twenty matrices. Participants in the shredder condition claimed to have solved an average of six—two more than in the control condition. And this overall increase did not result from a few individuals who claimed to solve a lot more matrices, but from lots of people who cheated by just a little bit.

The question then becomes how to reduce cheating. Ariely describes one idea:

[O]ur memory and awareness of moral codes (such as the Ten Commandments) might have an effect on how we view our own behavior.

… We took a group of 450 participants and split them into two groups. We asked half of them to try to recall the Ten Commandments and then tempted them to cheat on our matrix task. We asked the other half to try to recall ten books they had read in high school before setting them loose on the matrices and the opportunity to cheat. Among the group who recalled the ten books, we saw the typical widespread but moderate cheating. On the other hand, in the group that was asked to recall the Ten Commandments, we observed no cheating whatsoever. And that was despite the fact that no one in the group was able to recall all ten.

This result was very intriguing. It seemed that merely trying to recall moral standards was enough to improve moral behavior.

This experiment comes from a paper co-authored by Nina Mazar, On Amir and Ariely (pdf). (I’m not sure where the 450 students in the book comes from – the paper reports 229 students for this experiment. A later experiment in the paper uses 450. There were also a few differences in this experiment to the general cheating story above. People took their answers home for “recycling”, rather than shredding them, and payment was $10 per correct matrix to two randomly selected students.)

This experiment has now been subject to a multi-lab replication by Verschuere and friends. The abstract of the paper:

The self-concept maintenance theory holds that many people will cheat in order to maximize self-profit, but only to the extent that they can do so while maintaining a positive self-concept. Mazar, Amir, and Ariely (2008; Experiment 1) gave participants an opportunity and incentive to cheat on a problem-solving task. Prior to that task, participants either recalled the 10 Commandments (a moral reminder) or recalled 10 books they had read in high school (a neutral task). Consistent with the self-concept maintenance theory, when given the opportunity to cheat, participants given the moral reminder priming task reported solving 1.45 fewer matrices than those given a neutral prime (Cohen ́s d = 0.48); moral reminders reduced cheating. The Mazar et al. (2008) paper is among the most cited papers in deception research, but it has not been replicated directly. This Registered Replication Report describes the aggregated result of 25 direct replications (total n = 5786), all of which followed the same pre-registered protocol. In the primary meta-analysis (19 replications, total n = 4674), participants who were given an opportunity to cheat reported solving 0.11 more matrices if they were given a moral reminder than if they were given a neutral reminder (95% CI: -0.09; 0.31). This small effect was numerically in the opposite direction of the original study (Cohen ́s d = -0.04).

And here’s a chart demonstrating the result (Figure 2):

Figure 2

Multi-lab experiments like this are fantastic. There’s little ambiguity about the result.

That said, there is a response by Amir, Mazar and Ariely. Lots of fluff about context. No suggestion of “maybe there’s nothing here”.

The marshmallow test held up OK

MischelA common theme I see on my weekly visits to Twitter is the hordes piling onto the latest psychological study or effect that hasn’t survived a replication or meta-analysis. More often than not, the study deserves the criticism. But recently, the hordes have occasionally swung into action too quickly.

One series of tweets suggested that loss aversion had entered the replication crisis. A better description of the two papers that triggered the tweets is that they were the latest salvos in a decade-old debate about the interpretation of many loss aversion experiments. They have nothing to do with replication. (If you’re interested, the papers are here (ungated) and here. I have sympathy with parts of the arguments, and some other critiques of the concept of loss aversion. I’ll discuss these papers in a later post.)

Another set of tweets concerned a conceptual replication of the marshmallow test. Many of the comments suggested that the replication was a failure, and that the original study was rubbish. My view is that the original work has actually held up OK, although the interpretation of the result and some of the story-telling that followed the study is challenged.

First, to the original paper by Shoda, Mischel, and Peake, published in 1990 (pdf). In that study, four-year old children were placed at a table with a bell and a pair of “reward objects”. The pair of regard objects might be one marshmallow and two marshmallows, or one pretzel and two pretzels, and so on.

The children were told that the experimenter was going to leave the room, and that if they waited until the experimenter came back, they could have their preferred reward (the two marshmallows). Otherwise, they could call the experimenter back earlier by ringing the bell, but in that case they could only have their less preferred reward (one marshmallow). (Could a truly impatient child just not ring the bell and eat all three marshmallows?) The time until the children rang the bell, up to a maximum of 15 to 20 minutes, was recorded.

The headline result was that the time to ring the bell was predictive of future achievement in the SAT. Those who delayed their gratification had higher achievement. The time waited correlated 0.57 with SAT math scores and 0.42 with SAT verbal scores.

The new paper discusses a “conceptual replication”. It doesn’t copy the experimental design and replicate it precisely, but relies on a similar experimental design and a measure of academic achievement based on a composite of age-15 reading and math scores.

The main point to emerge from this replication is that there is an association between the delay in gratification and academic achievement, but the correlation (0.28) is only half to two-thirds of that found in the original study.

Anyone familiar with the replication literature will find this reduction in correlation unsurprising. One of the headline findings from the Reproducibility Project was that effect sizes in replications were around half of those in the original studies. Small sample sizes (low experimental power) also tend to result in Type M errors, whereby the effect size is exaggerated. (The original study only had 35 children in the baseline condition for which they were able to get the later academic results.)

Shoda and friends recognised this possibility (although perhaps not the reasons for it). As they wrote in the original paper:

[G]iven the smallness of the sample, the obtained coefficients could very well exaggerate the magnitude of the true association. For example, in the diagnostic condition, the 95% confidence interval for the correlation of preschool delay time with SAT verbal score ranges from .10 to .66, and with SAT quantitative score, the confidence interval ranges from .29 to .76. The value and importance given to SAT scores in our culture make caution essential before generalizing from the present study; at the very least, further replications with other populations, cohorts, and testing conditions seem necessary next steps.

The differences between the experiments could also be behind the difference in size of correlation. Each study used different measures of achievement. The marshmallow test in the replication had a maximum wait of only 7 minutes, compared to 15 to 20 minutes in the original (although most of the predictive power in the new study was found to be in the first 20 seconds). The replication created categories for time waited (e.g. 0 to 20 seconds, 20 seconds to 2 minutes, and so on), rather than using time as a continuous variable. It also focused on children with parents who did not have a college education – too many of the children with college-educated parents waited the full seven minutes. The original study drew its sample from the Stanford community.

Given the original authors’ notes about effect size, and the differences in study design, the original findings have held up rather well. For a simple diagnostic, the marshmallow test still has a surprising amount of predictive power. Delay of gratification at age 4 predicts later achievement. Some of the write-ups of this new work have stated that the marshmallow test may not be as strong a predictor of future outcomes as previously believed, but how strong did you actually believe it to be in the first place?

The other headline from the replication is that the predictive ability of the marshmallow test disappears with controls. That is, if you account for the children’s socioeconomic status, parental characteristics and a set of measures of cognitive and behavioural development, the marshmallow test does not provide any further information about that future achievement. It’s no surprise that controls of this nature do this. It simply suggests that the controls are better predictors. The original claim was not that the marshmallow test was the best or only predictor.

What is called into question are the implications that have been drawn from the marshmallow test studies. Shoda and friends suggested that the predictive power of the test might be related to the meta-cognitive strategies that the children employed. For instance, successful children might divert themselves so that they don’t just sit and stare at the marshmallows. If that is the case, we could teach children these strategies, and they might then be better able to delay gratification and have higher achievement in life. This has been a common theme of discussion of the marshmallow test for the last 30 years.

In the replication data, most of the predictive power of the marshmallow test was found to lie in the first 20 seconds. There was not a lot of difference between the kids who waited more than 20 seconds and those that waited the full seven minutes. It is questionable whether meta-cognitive strategies come into play in those first few seconds. If not, there may be little benefit in teaching children strategies to enable them to delay gratification. It seems less a problem of developing strategies for gratification, and more one of basic impulse control. To increase future achievement, broader behaviour and cognitive change might be required.

Teacher expectations and self-fulfilling prophesies

I first came across the idea of teacher expectations turning into self-fulfilling prophesies more than a decade ago, in Steven Covey’s The 7 Habits of Highly Effective People:

One of the classic stories in the field of self-fulfilling prophecies is of a computer in England that was accidently programmed incorrectly. In academic terms, it labeled a class of “bright” kids “dumb” kids and a class of supposedly “dumb” kids “bright.” And that computer report was the primary criterion that created the teachers’ paradigms about their students at the beginning of the year.

When the administration finally discovered the mistake five and a half months later, they decided to test the kids again without telling anyone what had happened. And the results were amazing. The “bright” kids had gone down significantly in IQ test points. They had been seen and treated as mentally limited, uncooperative, and difficult to teach. The teachers’ paradigms had become a self-fulfilling prophecy.

But scores in the supposedly “dumb” group had gone up. The teachers had treated them as though they were bright, and their energy, their hope, their optimism, their excitement had reflected high individual expectations and worth for those kids.

These teachers were asked what it was like during the first few weeks of the term. “For some reason, our methods weren’t working,” they replied. “So we had to change our methods.” The information showed that the kids were bright. If things weren’t working well, they figured it had to be the teaching methods. So they worked on methods. They were proactive; they worked in their Circle of Influence. Apparent learner disability was nothing more or less than teacher inflexibility.

I tried to find the source for this story, and failed. But what I did find was a similar concept called the Pygmalion effect, and assumed that Covey’s story was a mangled or somewhat made-up telling of that research.

What is the Pygmalion effect? It has appeared in my blog feed twice in the past two weeks. Here’s a slice from the first, by Shane Parrish at Farnam Street, describing the effect and the most famous study in the area:

The Pygmalion effect is a psychological phenomenon wherein high expectations lead to improved performance in a given area. Its name comes from the story of Pygmalion, a mythical Greek sculptor. Pygmalion carved a statue of a woman and then became enamored with it. Unable to love a human, Pygmalion appealed to Aphrodite, the goddess of love. She took pity and brought the statue to life. The couple married and went on to have a daughter, Paphos.

Research by Robert Rosenthal and Lenore Jacobson examined the influence of teachers’ expectations on students’ performance. Their subsequent paper is one of the most cited and discussed psychological studies ever conducted.

Rosenthal and Jacobson began by testing the IQ of elementary school students. Teachers were told that the IQ test showed around one-fifth of their students to be unusually intelligent. For ethical reasons, they did not label an alternate group as unintelligent and instead used unlabeled classmates as the control group. It will doubtless come as no surprise that the “gifted” students were chosen at random. They should not have had a significant statistical advantage over their peers. As the study period ended, all students had their IQs retested. Both groups showed an improvement. Yet those who were described as intelligent experienced much greater gains in their IQ points. Rosenthal and Jacobson attributed this result to the Pygmalion effect. Teachers paid more attention to “gifted” students, offering more support and encouragement than they would otherwise. Picked at random, those children ended up excelling. Sadly, no follow-up studies were ever conducted, so we do not know the long-term impact on the children involved.

The increases in IQ were 8 IQ points for the control group, and 12 points for those who were “growth spurters”. (The papers describing the study – from 1966 (pdf) and 1968 (pdf) – are somewhat thin on the experimental methodology, but it seems the description used in the study was “growth spurters” or high scorers in a “test for intellectual blooming”).

I always took the Pygmalion effect with a grain of salt. Most educational interventions have little to zero effect – particularly over the long-run – even when they involve far more than giving a label.

As it turns out, the story is not as clean as Parrish and others typically tell it. There have been battles over the Pygmalion effect since the original paper, with failed replications, duelling meta-analyses and debates about what the Pygmalion effect actually is.

Bob C-J discusses this at The Introduction to the New Statistics (HT: Slate Star Codex – the second appearance of the Pygmalion effect in my feed). Here is a cut of Bob C-J’s summary of these battles:

The original study was shrewdly popularized and had an enormous impact on policy well before sufficient data had been collected to demonstrate it is a reliable and robust result.

Critics raged about poor measurement, flexible statistical analysis, and cherry-picking of data.

That criticism was shrugged off.

Replications were conducted.

The point of replication studies was disputed.

Direct replications that showed no effect were discounted for a variety of post-hoc reasons.

Any shred of remotely supportive evidence was claimed as a supportive replication.  This stretched the Pygmalion effect from something specific (an impact on actual IQ) to basically any type of expectancy effect in any situation…. which makes it trivially true but not really what was originally claimed.  Rosenthal didn’t seem to notice or mind as he elided the details with constant promotion of the effect. …

Multiple rounds of meta-analysis were conducted to try to ferret out the real effect; though these were always contested by those on opposing sides of this issue.  …

Even though the best evidence suggests that expectation effects are small and cannot impact IQ directly, the Pygmalion Effect continues to be taught and cited uncritically.  The criticisms and failed replications are largely forgotten.

The truth seems to be that there *are* expectancy effects–but:

  • that there are important boundary conditions (like not producing real effects on IQ)
  • they are often small
  • and there are important moderators (Jussim & Harber, 2005).

The Jussim and Harber paper (pdf) Bob C-J references provides a great discussion of the controversy. (Bob C-J also recommends a book by Jussim). Here’s a section of the abstract:

This article shows that 35 years of empirical research on teacher expectations justifies the following conclusions: (a) Self-fulfilling prophecies in the classroom do occur, but these effects are typically small, they do not accumulate greatly across perceivers or over time, and they may be more likely to dissipate than accumulate; (b) powerful self-fulfilling prophecies may selectively occur among students from stigmatized social groups; (c) whether self-fulfilling prophecies affect intelligence, and whether they in general do more harm than good, remains unclear, and (d) teacher expectations may predict student outcomes more because these expectations are accurate than because they are self-fulfilling.

That paper contains some amusing facts about the original Rosenthal and Jacobson study. Some students had pre-test IQ scores near zero, others near 200, yet “the children were neither vegetables nor geniuses.” Exclude scores outside of the range 60 to 160, and the effect disappears. Five of the “bloomers” had increases of over 90 IQ points. Again, exclude these five and the effect disappears. The original study is basically worthless. While there is something to the effect of teacher expectations on students, the gap between the story telling and reality is rather large.

Bankers are more honest than the rest of us

Well, probably not. But that’s one interpretation you could take from a the oft-quoted and cited Nature paper by Cohn and colleagues Business culture and dishonesty in the banking industry. That bankers are more honest is as plausible as the interpretation of the experiment provided by the authors.

As background to this paper, here’s an extract from the abstract:

[W]e show that employees of a large, international bank behave, on average, honestly in a control condition. However, when their professional identity as bank employees is rendered salient, a significant proportion of them become dishonest. … Our results thus suggest that the prevailing business culture in the banking industry weakens and undermines the honesty norm, implying that measures to re-establish an honest culture are very important.

I’ve known of this paper since it was first published (plenty of media and tweets), but have always placed it in the basket of likely not true and unlikely to be replicated. Show me some pre-registered replications and I would pay attention. As a result, I didn’t investigate any further.

But recently Koen Smets pointed me toward a working paper from Jean-Michel Hupé that critiqued the statistical analysis. That paper in turn pointed to a critique by Vranka and Houdek, Many faces of bankers’ identity: how (not) to study dishonesty.

These critiques caused me to go back to the Nature paper – and importantly, to the supplementary materials – and read it in detail. It has a host of problems besides being unlikely to replicate. The most interesting of these could lead us to ask whether bankers are actually more honest.

The experiment

Cohn and friends recruited 128 bank employees and randomly split them into two groups, the treatment and control. Before undertaking the experimental task, the treatment group was “primed” with a series of questions that reminded them that they were a bank employee (e.g. At which bank are you presently employed?). The control group were asked questions unrelated to their professional identity.

The experimenters then asked each member of these two groups to flip a coin 10 times, reporting the result via a computer. No-one else could see what they had flipped. For each flip that came up the right way, the experimenters paid them (approximately) $20 (or more precisely, they would be paid $20 per flip if they equalled or outperformed a randomly selected colleague). Ten correct flips and you could have $200 coming your way.

So how can we know if any particular person is telling the truth? You can’t. But across a decent sized group, you know the distribution of results that you would expect (a binomial distribution with a mean of 0.5). You would expect, on average, 50% heads and 50% tails. Someone getting 10 heads is a 1 in a thousand event. By comparing the distribution of the results to what you would expect, you can infer the level of cheating.

So, how did the bankers go? In the control group, 51.6% of coin flips were successful. It’s slightly more than 50%, but within the realms of chance for a group of honest coin flippers. The bankers primed with their professional identity reported 58.2% successful flips, 6.6 percentage points more than the control group. The dishonest bandits.

But how do we know that this result is particular to bankers? What if we primed other professionals with their profession? What if we took a group with no connection to the banking industry and primed them with banking concepts?

Cohn and friends answered these questions directly. When they primed a group of non-banking professionals with their professional identity, they reported 3 percentage points fewer successful coin flips than those in a control condition. Students primed with banking concepts also reported fewer successes, around 1.5%. These differences weren’t statistically significant and could have happened by chance, with no detectable effect from the primes.

These experimental outcomes are the centrepiece behind the conclusion that the prevailing culture in banking weakens and undermines the honesty norm.

But now let’s go to the supplementary materials and learn a bit more about these non-banking professionals and students.

An alternative interpretation

I have only reported the differences in successful coin flips above – as did the authors in the main paper (in a chart, Figure 3a). So how many successes did these non-banking professionals and students have?

In the control condition, the non-banking professionals reported 59.8% successful flips. This dropped to 55.8% when primed with their professional identity. The students were also dishonest bandits, reporting 57.9% successful flips in the control condition, and 56.4% in the banking prime condition.

So looking across the three groups (bankers, non-banking professionals and students), the only honest group we have come across are the bankers in the control condition.

This raises the question of what the appropriate reference point for this analysis is. Should we be asking if banking primes induce banker dishonesty? Or should we be asking whether the control primes – which were designed to be innocuous – can induce honesty? To accept that the banking prime induces bankers to cheat more, we also need to have a starting point that bankers, on the whole, cheat less.

I don’t see a great deal of value in trying to interpret this result and determine which of these frames are correct, as the result is just noise. It is unlikely to replicate. But once you look at these numbers, the interpretation by Cohn and friends appears little more than an overly keen attempt to get the results to fit their “theoretical framework”.

Other problems

I’ve just picked my favourite problem, but the two critiques I linked above argue that there are others. Vranka and Houdek suggest that there are many other ways to interpret the results. I agree with that overarching premise, but am less convinced by some of their suggested alternatives, such as the presence of stereotype or money primes. Those primes seem as robust as this banking prime is likely to be.

Hupé critiques the statistical approach, with which I also have some sympathy, but I haven’t spent enough time thinking about it to agree with his suggested alternative approach.

A quick afterthought

That this experimental result is bunk is not a reason to dismiss the idea that banking culture is poor or that exposure to that culture increases dishonesty. The general problem with the priming literature is that it attempts to elicit differences through primes that are insignificant relative to the actual environments people face.

For example, there is a large difference between answering a few questions about banking and working in a bank. In the latter, you are surrounded by other people, interacting with them daily, seeing what they do. Just because a few questions do not produce an effect doesn’t mean that months of exposure to a your work environment won’t change behaviour. Unfortunately, experiments such as this add approximately zero useful information as to whether this is actually the case.

Noise

Daniel Kahneman has a new book in the pipeline called Noise. It is to be co-authored with Cass Sunstein and Olivier Sibony, and will focus on the “chance variability in human judgment”, the “noise” of the book’s title.

I hope the book is more Kahneman than Sunstein. For all Thinking, Fast and Slow’s faults, it is a great book. You can see the thought that went into constructing it.

Sunstein’s recent books feel like research papers pulled together by a university student – which might not be too far from the truth given the fleet of research assistants at Sunstein’s command. Part of the flatness of Sunstein’s books might also come from his writing pace – he writes more than a book a year. (I count over 30 on his Wikipedia page since 2000, and 10 in the last five years.) Hopefully Kahneman will slow things down, although with a planned publication date of 2020, Noise will be a shorter project than Thinking, Fast and Slow.

What is noise?

Kahneman has already written about noise, most prominently with three colleagues in Harvard Business Review. In that article they set out the case for examining noise in decision-making and how to address it.

Part of that article is spent distinguishing noise from bias. Your bathroom scale is biased if it always reads four kilograms too heavy. If it gives you a different reading each time you get on the scale, it is noisy. Decisions can be noisy, biased, or both. A biased but low noise decision will always be wrong. A biased but high noise decision will be all over the shop but might occasionally get lucky.

One piece of evidence for noise in decision-making is the degree to which people will contradict their own prior judgments. Pathologists assessing biopsy results had a correlation of 0.63 with their own judgment of severity when shown the same case twice (the HBR article states 0.61, but I read the referenced article as stating 0.63). Software programmers differed by a median of 71% in the estimates for the same project, with a correlation of 0.7 between their first and second effort. The lack of consistency in decision-making only grows once you start looking across people.

I find the concept of noise a useful way of thinking about decision-making. One of the main reasons why simple algorithms are typically superior to human decision makers is not because of bias or systematic errors by the humans, but rather the inconsistency of human judgment. We are often all over the place.

Noise is also a good way of identifying those domains where arguments about the power of human intuition and decision-making (which I often make) fall down. Simple heuristics can make us smart. Developed in the right circumstances, naturalistic decision-making can lead to good decisions. But where human decisions are inconsistent, or noisy, it is often unchallenging to identify better alternatives.

Measuring noise

One useful feature of noise is that you can measure it without knowing the correct or best decision. If you don’t know your weight, it is hard to tell if the scale is biased. But the fact it differs in measurement as you get on, off, and on again points to the noise. If you have a decision for which there is a large lag before you know if it was the right one, this lag is an obstacle to measuring bias, but not for noise.

This ability to measure noise without knowing the right answer also avoids many of the debates about whether the human decisions are actually biased. Two inconsistent decisions cannot both be right.

You can measure noise in an organisation’s decision-making processes by examining pairs of decision makers and calculating the relative deviation of their judgments from each other. If one decision maker recommends, say, a price of $200, and the other of $400, the noise is 66%. (They were $200 apart, with the average of the two being $300. 200/300=0.66). You average this noise score across all possible pairs to give you the noise score for that decision.

The noise score has an intuitive meaning. It is the expected relative difference if you picked any two decision makers at random.

In the HBR article, Kahneman and colleagues report on the noise measurements for ten decisions in two financial services organisations. The noise was between 34% to 62% for the six decisions in organisation A, with an average noise of 48%. Noise was between 46% and 70% for the four decisions in organisation B, with an average noise of 60%. This was substantially above the organisations’ expectations. Experience of the decision makers did not appear to reduce noise.

Reducing noise

The main solution proposed by Kahneman and friends to reduce noise is replacing human judgement with algorithms. By returning the same decision every time, the algorithms are noise free.

Rather than suggesting a complex algorithm, Kahneman and friends propose what they call a “reasoned rule”. Here are the five steps in developing a reasoned rule, with loan application assessment an example:

  1. Select six to eight variables that are distinct and obviously related to the predicted outcome. Assets and revenues (weighted positively) and liabilities (weighted negatively) would surely be included, along with a few other features of loan applications.
  2. Take the data from your set of cases (all the loan applications from the past year) and compute the mean and standard deviation of each variable in that set.
  3. For every case in the set, compute a “standard score” for each variable: the difference between the value in the case and the mean of the whole set, divided by the standard deviation. With standard scores, all variables are expressed on the same scale and can be compared and averaged.
  4. Compute a “summary score” for each case―the average of its variables’ standard scores. This is the output of the reasoned rule. The same formula will be used for new cases, using the mean and standard deviation of the original set and updating periodically.
  5. Order the cases in the set from high to low summary scores, and determine the appropriate actions for different ranges of scores. With loan applications, for instance, the actions might be “the top 10% of applicants will receive a discount” and “the bottom 30% will be turned down.”

The reliability of this reasoned rule – it returns the same outcome every time – gives it a large advantage over the human.

I suspect that most lenders are already using more sophisticated models than this, but the strength of a simple approach was shown in Robyn Dawes’s classic article The Robust Beauty of Improper Linear Models in Decision Making (ungated pdf). You typically don’t need a “proper” linear model, such as that produced by regression, to outperform human judgement.

As a bonus, improper linear models, as they are less prone to overfitting, often perform well compared to proper models (as per Simple Heuristics That Make Us Smart). Fear of the expense of developing a complex algorithm is not an excuse to leave the human decisions alone.

Ultimately the development of the reasoned rule cannot avoid the question of what the right answer to the problem is. It will take time to determine definitively whether it outperforms. But if the human decision is noisy, there is an excellent chance that it will hit closer to the mark, on average, that the scattered human decisions.