People should use their judgment … except they’re often lousy at it

My Behavioral Scientist article, Don’t Touch The Computer was in part a reaction to Andrew McAfee and Eric Brynjolfsson’s book The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies. In particular, I felt their story of freestyle chess as an illustration of how humans and machines can work together was somewhat optimistic.

I have just read McAfee and Brynjolfsson’s Machine, Platform, Crowd: Harnessing Our Digital Future. Chapter 2, titled The Hardest Thing to Accept About Ourselves, runs a line somewhat closer to mine. Here are some snippets:

[L]et people develop and exercise their intuition and judgment in order to make smart decisions, while the computers take care of the math and record keeping. We’ve heard about and seen this division of labor between minds and machines so often that we call it the “standard partnership.”

The standard partnership is compelling, but sometimes it doesn’t work very well at all. Getting rid of human judgments altogether—even those from highly experienced and credentialed people—and relying solely on numbers plugged into formulas, often yields better results.

Here’s one example:

Sociology professor Chris Snijders used 5,200 computer equipment purchases by Dutch companies to build a mathematical model predicting adherence to budget, timeliness of delivery, and buyer satisfaction with each transaction. He then used this model to predict these outcomes for a different set of transactions taking place across several different industries, and also asked a group of purchasing managers in these sectors to do the same. Snijders’s model beat the managers, even the above-average ones. He also found that veteran managers did no better than newbies, and that, in general, managers did no better looking at transactions within their own industry than at distant ones.

This is a general finding:

A team led by psychologist William Grove went through 50 years of literature looking for published, peer-reviewed examples of head-to-head comparisons of clinical and statistical prediction (that is, between the judgment of experienced, “expert” humans and a 100% data-driven approach) in the areas of psychology and medicine. They found 136 such studies, covering everything from prediction of IQ to diagnosis of heart disease. In 48% of them, there was no significant difference between the two; the experts, in other words, were on average no better than the formulas. A much bigger blow to the notion of human superiority in judgment came from the finding that in 46% of the studies considered, the human experts actually performed significantly worse than the numbers and formulas alone. This means that people were clearly superior in only 6% of cases. And the authors concluded that in almost all of the studies where humans did better, “the clinicians received more data than the mechanical prediction.”

Despite this victory, it seems a good idea to check the algorithm’s output.

In many cases … it’s a good idea to have a person check the computer’s decisions to make sure they make sense. Thomas Davenport, a longtime scholar of analytics and technology, calls this taking a “look out of the window.” The phrase is not simply an evocative metaphor. It was inspired by an airline pilot he met who described how he relied heavily on the plane’s instrumentation but found it essential to occasionally visually scan the skyline himself.

But …

As companies adopt this approach, though, they will need to be careful. Because we humans are so fond of our judgment, and so overconfident in it, many of us, if not most, will be too quick to override the computers, even when their answer is better. But Chris Snijders, who conducted the research on purchasing managers’ predictions highlighted earlier in the chapter, found that “what you usually see is [that] the judgment of the aided experts is somewhere in between the model and the unaided expert. So the experts get better if you give them the model. But still the model by itself performs better.”

So, measure which is best:

We support having humans in the loop for exactly the reasons that Meehl and Davenport described, but we also advocate that companies “keep score” whenever possible—that they track the accuracy of algorithmic decisions versus human decisions over time. If the human overrides do better than the baseline algorithm, things are working as they should. If not, things need to change, and the first step is to make people aware of their true success rate.

Accept the result will often be to defer to the algorithm:

Most of us have a lot of faith in human intuition, judgment, and decision-making ability, especially our own …. But the evidence on this subject is so clear as to be overwhelming: data-driven, System 2 decisions are better than those that arise out of our brains’ blend of System 1 and System 2 in the majority of cases where both options exist. It’s not that our decisions and judgment are worthless; it’s that that they can be improved on. The broad approaches we’ve seen here—letting algorithms and computer systems make the decisions, sometimes with human judgment as an input, and letting the people override them when appropriate—are ways to do this.

And from the chapter summary:

The evidence is overwhelming that, whenever the option is available, relying on data and algorithms alone usually leads to better decisions and forecasts than relying on the judgment of even experienced and “expert” humans.

Many decisions, judgments, and forecasts now made by humans should be turned over to algorithms. In some cases, people should remain in the loop to provide commonsense checks. In others, they should be taken out of the loop entirely.

In other cases, subjective human judgments should still be used, but in an inversion of the standard partnership: the judgments should be quantified and included in quantitative analyses.

Algorithms are far from perfect. If they are based on inaccurate or biased data, they will make inaccurate or biased decisions. These biases can be subtle and unintended. The criterion to apply is not whether the algorithms are flawless, but whether they outperform the available alternatives on the relevant metrics, and whether they can be improved over time.

As for the remainder of the book, I have mixed views. I enjoyed the chapters on machines. The four chapters on platforms and first two on crowds were less interesting, and much could have been written five years ago (e.g. the stories on Wikipedia, Linux, two-sided platforms). The closing two chapters on crowds, which discussed decentralisation, complete contracts and the future of the firm were, however, excellent.


Philip Tetlock on messing with the algorithm

From an 80,000 hours podcast episode:

Robert Wiblin: Are you a super forecaster yourself?

Philip Tetlock: No. I could tell you a story about that. I actually thought I could be, I would be. So in the second year of the forecasting tournament, by which time I should’ve known enough to know this was a bad idea. I decided I would enter into the forecasting competition and make my own forecasts. If I had simply done what the research literature tells me would’ve been the right thing and looked at the best algorithm that distills the most recent forecast or the best forecast and then extremises as a function of the diversity of the views within, if I had simply followed that, I would’ve been the second best forecaster out of all the super forecasters. I would have been like a super, super forecaster.

However, I insisted … What I did is I struck a kind of compromise. I didn’t have as much time as I needed to research all the questions, so I deferred to the algorithms with moderate frequency. I often tweaked them. I often said they’re not right about that, I’m going tweak this here, I’m going to tweak this here. The net effect of all my tweaking effort, which was to move me from being in second place which I would’ve been if I’d mindlessly adopted the algorithmic prediction, to about 35th place. So that was … I fell 33 positions thanks to the cognitive effort I devoted there.

Tetlock was tweaking an algorithm that is built on human inputs (forecasts), so this isn’t a lesson that we can leave decision-making to an algorithm. The humans are integral to the process. But it is yet another story of humans taking algorithmic outputs and making them worse.

The question of where we should simply hand over forecasting decisions to algorithms is being explored in a new IARPA tournament involving human, machine, and human-machine hybrid forecasters. It will create some interesting data on the boundaries of where each performs best – although the algorithm described by Tetlock above and used by the Good Judgment team suggests that even a largely human system will likely need statistical combination of forecasts to succeed.

Robert Wiblin: [F]irst, you have a new crowdsourcing tournament going on now, don’t you, called Hybrid Mind?

Philip Tetlock: Well, I wouldn’t claim that it belongs to me. It belongs to IARPA, the Intelligence Advanced Research Projects Activity, which is the same operation and US intelligence community that ran the earlier forecasting tournament. The new one is called Hybrid Forecasting Competition, and it, I think, represents a very important new development in forecasting technology. It pits humans against machines against human-machine hybrids, and they’re looking actively for human volunteers.

So is the place to go if you want to volunteer.

Well, there are a lot of unknowns. It may seem obvious that machines will have an advantage when you’re dealing with complex quantitative problems. It would be very hard for humans to do better than machines when you’re trying to forecast, say, patterns of economic growth in OECD countries where you have very rich, pre-quantified time series, cross-sectional data sets, correlation matrices, lots of macro models. It’s hard to imagine people doing much better than that, but it’s not impossible because the models often over fit.

So far, as the better forecasters are aware of turbulence on the horizon and appropriately adjust their forecasts, they could even have an advantage on turf where we might assume machines would be able to do better.

So there’s a domain, I think, of questions where there’s kind of a presumption among many people observe these things that the machines have an advantage. Then there are questions where people sort of scratch their heads and say how could the machines possibly do questions like this? Here, they have in mind the sorts of questions that were posed, many of the questions that were posed anyway, on the earlier IARPA forecasting tournament, the one that lead to the discovery of super forecasters.

These are really hard questions about how long is the Syrian civil war going to last in 2012? Is the war going to last another six months or another 12 months? When the Swiss and French medical authorities do an autopsy on Yasser Arafat, will they discover polonium? It’s hard to imagine machines getting a lot of traction on many of these quite idiosyncratic context-specific questions where it’s very difficult to conjure any kind of meaningful statistical model.

Although, when I say it’s hard to construct those things, it doesn’t mean it’s impossible.

Finally, Robert Wiblin is a great interviewer. I recommend subscribing to the 80,000 hours podcast.

Michael Lewis’s The Undoing Project: A Friendship That Changed The World

LewisMy journey into understanding human decision making started when I read Michael Lewis’s Moneyball in 2005. The punchline – which, as it turns out, has been known across numerous domains since at least the 1950s – is that “expert” judgement is often outperformed by simple statistical analysis.

A couple of years later I read Malcolm Gladwell’s Blink and was diverted into the world of  Gary Klein, which then led me to Kahneman and Tversky among others. It was only then that I started to think about the what it is that causes the experts to under-perform (For all Gladwell’s flaws, Gladwell is a great gateway to new ideas).

In the opening to The Undoing Project: A Friendship That Changed The World, Lewis tells of a similar intellectual journey (although obviously with a somewhat closer connection to Moneyball):

[O]nce the dust had settled on the responses to my book [Moneyball], one of them remained more alive and relevant than the others: a review by a pair of academics, then both at the University of Chicago—an economist named Richard Thaler and a law professor named Cass Sunstein. Thaler and Sunstein’s piece, which appeared on August 31, 2003, in the New Republic, managed to be at once both generous and damning. The reviewers agreed that it was interesting that any market for professional athletes might be so screwed-up that a poor team like the Oakland A’s could beat most rich teams simply by exploiting the inefficiencies. But—they went on to say—the author of Moneyball did not seem to realize the deeper reason for the inefficiencies in the market for baseball players: They sprang directly from the inner workings of the human mind. The ways in which some baseball expert might misjudge baseball players—the ways in which any expert’s judgments might be warped by the expert’s own mind—had been described, years ago, by a pair of Israeli psychologists, Daniel Kahneman and Amos Tversky. My book wasn’t original. It was simply an illustration of ideas that had been floating around for decades and had yet to be fully appreciated by, among others, me.

Lewis realised that there was a deeper story to tell, with The Undoing Project the result.

I am increasingly of the view that a biography or autobiography is one of the more effective (although not always balanced) ways to lay out a set of ideas. Between The Undoing Project and Richard Thaler’s Misbehaving, a layperson would struggle to find a more accessible and interesting introduction to behavioural science and behavioural economics.

The first substantive chapter of the Undoing Project focuses on Daryl Morey, General Manager of the Houston Rockets. It felt like a Moneyball style essay for which Lewis hadn’t been able to find another use (although you can read this chapter on Slate). However, it was an interesting illustration of the idea that once you have the statistics in hand, it’s still hard to eliminate the involvement of the human mind. For instance:

If he could never completely remove the human mind from his decision-making process, Daryl Morey had at least to be alive to its vulnerabilities. He now saw these everywhere he turned. One example: Before the draft, the Rockets would bring a player in with other players and put him through his paces on the court. How could you deny yourself the chance to watch him play? But while it was interesting for his talent evaluators to see a player in action, it was also, Morey began to realize, risky. A great shooter might have an off day; a great rebounder might get pushed around. If you were going to let everyone watch and judge, you also had to teach them not to place too much weight on what they were seeing. (Then why were they watching in the first place?) If a guy was a 90 percent free-throw shooter in college, for instance, it really didn’t matter if he missed six free throws in a row during the private workout.

Morey leaned on his staff to pay attention to the workouts but not allow whatever they saw to replace what they knew to be true. Still, a lot of people found it very hard to ignore the evidence of their own eyes. A few found the effort almost painful, as if they were being strapped to the mast to listen to the Sirens’ song. One day a scout came to Morey and said, “Daryl, I’ve done this long enough. I think we should stop having these workouts. Please, just stop doing them.” Morey said, Just try to keep what you are seeing in perspective. Just weight it really low. “And he says, ‘Daryl, I just can’t do it.’ It’s like a guy addicted to crack,” Morey said. “He can’t even get near it without it hurting him.”

I tend to have little interest in personal histories, so I found the following chapters leading up to Kahneman and Tversky’s collaboration less interesting. In part, this is because any attempt to understand someone’s achievements in the context of their upbringing is little more than storytelling.

But once the book hits the development of Kahneman and Tversky’s ideas – their work on the basic heuristics (availability, representativeness, anchoring ), the development of prospect theory, their work on happiness – the sequential discussion of how these ideas were developed added some real understanding (for me). You can also see the care that went into developing their work, with a desire to create something that would stand the test of time rather than create a headline through a cute result.

One of the more interesting parts of the book near the close relates to Kahneman and Tversky’s interaction with Gerd Gigerenzer (who I have written about a fair bit). While Lewis’s characterisation of Gigerenzer as an “evolutionary psychologist” is wide of the mark, Lewis captures well the frustration that I imagine Kahneman and Tversky must have felt during some of the exchanges. Lewis writes:

[I]n Danny and Amos’s view he’d ignored the usual rules of intellectual warfare, distorting their work to make them sound even more fatalistic about their fellow man than they were. He also downplayed or ignored most of their evidence, and all of their strongest evidence. He did what critics sometimes do: He described the object of his scorn as he wished it to be rather than as it was. Then he debunked his description.

This debate is interesting enough that I’ll explore it in more detail in a future post.

Angela Duckworth’s Grit: The Power of Passion and Perseverance

DuckworthIn Grit: The Power of Passion and Perseverance, Angela Duckworth argues that outstanding achievement comes from a combination of passion – a focused approach to something you deeply care about – and perseverance – a resilience and desire to work hard. Duckworth calls this combination of passion and perseverance “grit”.

For Duckworth, grit is important as focused effort is required to both build skill and turn that skill into achievement. Talent plus effort leads to skill. Skill plus effort leads to achievement. Effort appears twice in the equation. If one expends that effort across too many domains (no focus through lack of passion), the necessary skills will not be developed and those skills won’t be translated into achievement.

While sounding almost obvious written this way, Duckworth’s claims go deeper. She argues that in many domains grit is more important than “talent” or intelligence. And she argues that we can increase people’s grit through the way we parent, educate, coach and manage.

Three articles from 2016 (in SlateThe New Yorker and npr) critiquing Grit and the associated research make a lot of the points that I would. But before turning to those articles and my thoughts, I will say that Duckworth appears to be one of the most open recipients of criticism in academia that I have come across. She readily concedes good arguments, and appears caught between her knowledge of the limitations of the research and the need to write or speak in a strong enough manner to sell a book or make a TED talk.

That said, I am sympathetic with the Slate and npr critiques. Grit is not the best predictor of success. To the extent there is a difference between “grit” and the big five trait of conscientiousness, it is minor (making grit largely an old idea rebranded with a funkier name). A meta-analysis (working paper) by Marcus Credé, Michael Tynan and Peter Harms makes this case (and forms the basis of the npr piece).

Also critiqued in the npr article is Duckworth’s example of grittier cadets being more likely to make it through the seven-week West Point training program Beast Barracks, which features in the book’s opening. As she states, “Grit turned out to be an astoundingly reliable predictor of who made it through and who did not.”

The West Point research comes from two papers by Duckworth and colleagues from 2007 (pdf) and 2009 (pdf). The difference in drop out rate is framed as a rather large in the 2009 article:

“Cadets who scored a standard deviation higher deviation higher than average on the Grit-S were 99% more likely to complete summer training”

But to report the results another way, 95% of all cadets made it through. 98% of the top quartile in grit stayed. As Marcus Credé states in the npr article, there is only a three percentage point difference between the average drop out rate and that of the grittiest cadets. Alternatively, you can consider that 88% of the bottom quartile made it through. That appears a decent success rate for these low grit cadets. (The number reported in the paper references the change in odds, which is not the way most people would interpret that sentence. But on Duckworth being a great recipient of criticism, she concedes in the npr article she should have put it another way.)

Having said this, I am sympathetic to the argument that there is something here that West Point could benefit from. If low grit were the underlying cause of cadet drop-outs, reducing the drop out rate of the least gritty half to that of the top half could cut the drop out rate by more than 50%. If they found a way of doing this (which I am more sceptical about), it could be a worthwhile investment.

One thing that I haven’t been able to determine from the two papers with the West Point analysis is the distribution of grit scores for the West Point cadets. Are they gritty relative to the rest of the population? In Duckworth’s other grit studies, the already high achievers (spelling bee contestants, Stanford students, etc.) look a lot like the rest of us. Why does it take no grit to enter into domains which many people would already consider to be success? Is this the same for West Point?

Possibly the biggest question I have about the West Point study is why people drop out. As Duckworth talks about later in the book (repeatedly), there is a need to engage in search to find the thing you are passionate about. Detours are to be expected. When setting top-level goals, don’t be afraid to erase an answer that isn’t working out. Finishing what you begin could be a way to miss opportunities. Be consistent over time, but first find a thing to be consistent with. If your mid-level goals are not aligned with your top level objective, abandon them. And so on. Many of the “grit paragons” that Duckworth interviewed for her book explored many different avenues before settling on the one that consumes them.

So, are the West Point drop-outs leaving because of low grit, or are they are shifting to the next phase of their search? If we find them later in their life (at a point of success), will they then score higher on grit as they have found something they are passionate about that they wish to persevere with? How much of the high grit score of the paragons is because they have succeeded in their search? To what extent is grit simply a reflection of current circumstances?

One of the more interesting sections of the book addresses whether there are limits to what we can achieve due to talent. Duckworth’s major point is that we are so far from whatever limits we have that they are irrelevant.

On the one hand, that is clearly right – in almost every domain people could improve through persistent effort (and deliberate practice). But another consideration is where their personal limits lie relative to the degree of skill required to successfully achieve a person’s goals. I am a long way from my limits as a tennis player, but my limits are well short of that required to ever make a living from it.

Following from this, Duckworth is of the view that people should follow their passion and argues against the common advice that following your passion is the path to poverty. I’m with Cal Newport on this one, and think that “follow your passion” is horrible advice. If you don’t have anything of value to offer related to your passion, you likely won’t succeed.

Duckworth’s evidence behind her argument is mixed. She notes that people are more satisfied with jobs when they follow a personal interest, but this is not evidence that people who want to find a job that matches their interest are more satisfied. Where are those who failed? Duckworth also notes that these people perform better, but again, what is the aggregate outcome of all the people who started out with this goal?

One chapter concerns parenting. Duckworth concedes the evidence here is thin, incomplete and that there are no randomised controlled trials. But she then suggests that she doesn’t have time to wait for the data come in (which I suppose you don’t if you are already raising children).

She cites research on supportive versus demanding parenting, derived from measures such as surveys of students. These demonstrate that students with more demanding parents have higher grades. Similarly, research on world-class performers shows that their parents are models of work ethic. The next chapter reports on the positive relationship between extracurricular activities while at school and job outcomes, particularly where they stick with the same activity for two or more years (i.e. consistent parents).

But Duckworth does not address the typical problem of studies in this domain – they all ignore biology. Do the students receive higher grades because their parents are more demanding, or because they are the genetic descendants of two demanding people? Are they world-class performers because their parents model a work ethic, or because they have inherited a work ethic? Are they consistent with their extracurricular activities because their parents consistently keep them at it, or because they are the type of people likely to be consistent?

These questions might appear speculation in themselves, but the large catalogue of twin, adoption and now genetic studies points to the answers. To the degree children resemble their parents, this is largely genetic. The effect of the shared environment – i.e. parenting – is low (and in many studies zero). That is not say interventions cannot be developed. But they are not reflected in the variation in parenting the subject of these studies.

Duckworth does briefly turn to genetics when making her case for the ability to change someone’s grit. Like a lot of other behavioural traits, the heritability of grit is moderate: 37% for perseverance, 20% for passion (the study referenced is here). Grit is not set in stone, so Duckworth takes this as a case for the effect of environment.

However, a heritability less than one provides little evidence that deliberate changes in environment can change a trait. The same study finding moderate heritability also found no effect of shared environment (e.g. parenting). The evidence of influence is thin.

Finally, Duckworth cites the Flynn effect as evidence of the malleability of IQ – and how similar effects could play out with grit – but she does not reference the extended trail of failed interventions designed to increase IQ (although a recent meta-analyses show some effect of education). I can understand Duckworth’s aims, but feel that the literature in support of them is somewhat thin.

Other random points or thoughts:

  • As for any book that contain colourful stories of success linked to the recipe it is selling, the stories of the grit paragons smack of survivorship bias. Maybe the coach of the Seattle Seahawks pushes toward a gritty culture, but I’m not sure the other NFL teams go and get ice-cream every time training gets tough. Jamie Dimon, CEO of JP Morgan, is praised for the $5 billion profit JP Morgan gained through the GFC (let’s skate over the $13 billion in fines). How would another CEO have gone?
  • Do those with higher grit display a higher level of sunk cost fallacy, being unwilling to let go?
  • Interesting study – Tsay and Banaji, Naturals and strivers: Preferences and beliefs about sources of achievement. The abstract:

To understand how talent and achievement are perceived, three experiments compared the assessments of “naturals” and “strivers.” Professional musicians learned about two pianists, equal in achievement but who varied in the source of achievement: the “natural” with early evidence of high innate ability, versus the “striver” with early evidence of high motivation and perseverance (Experiment 1). Although musicians reported the strong belief that strivers will achieve over naturals, their preferences and beliefs showed the reverse pattern: they judged the natural performer to be more talented, more likely to succeed, and more hirable than the striver. In Experiment 2, this “naturalness bias” was observed again in experts but not in nonexperts, and replicated in a between-subjects design in Experiment 3. Together, these experiments show a bias favoring naturals over strivers even when the achievement is equal, and a dissociation between stated beliefs about achievement and actual choices in expert decision-makers.”

  • A follow up study generalised the naturals and strivers research over some other domains.
  • Duckworth reports on the genius research of Catharine Cox, in which Cox looked at 300 eminent people and attempted to determine what it was that makes them a genius. All 300 had an IQ above 100. The average of the top 10 was 146. The average of the bottom 10 was 143. Duckworth points to the trivial link between IQ and ranking within that 300, with the substantive differentiator being level of persistence. But note those average IQ scores…

Dealing with algorithm aversion

Over at Behavioral Scientist is my latest contribution. From the intro:

The first American astronauts were recruited from the ranks of test pilots, largely due to convenience. As Tom Wolfe describes in his incredible book The Right Stuff, radar operators might have been better suited to the passive observation required in the largely automated Mercury space capsules. But the test pilots were readily available, had the required security clearances, and could be ordered to report to duty.

Test pilot Al Shepherd, the first American in space, did little during his first, 15-minute flight beyond being observed by cameras and a rectal thermometer (more on the “little” he did do later). Pilots rejected by Project Mercury dubbed Shepherd “spam in a can.”


Astronaut Ham.

Other pilots were quick to note that “a monkey’s gonna make the first flight.” Well, not quite a monkey. Before Shepherd, the first to fly in the Mercury space capsule was a chimpanzee named Ham, only 18 months removed from his West African home. Ham performed with aplomb.

But test pilots are not the type to like relinquishing control. The seven Mercury astronauts felt uncomfortable filling a role that could be performed by a chimp (or spam). Thus started the astronauts’ quest to gain more control over the flight and to make their function more akin to that of a pilot. A battle for decision-making authority—man versus automated decision aid—had begun.

Head on over to Behavioral Scientist to read the rest.

While the article draws quite heavily on Tom Wolfe’s The Right Stuff, the use of the story of the Mercury astronauts was somewhat inspired by Charles Perrow’s Normal Accidents. Perrow looks at the two sides of the problems that emerged during the Mercury missions – the operator error, which formed the opening of my article, and the designer error, which features in the close.

One issue that became apparent to me during drafting was the distinction between an algorithm determining a course of action, and the execution of that action through mechanical, electronic or other means. The example of the first space flights clearly has this issue. Many of the problems were not that the basic calculations (the algorithms) were faulty. Rather, the execution failed. In early drafts of the article I tried to draw this distinction out, but it made the article clunky. I ultimately reduced this point to a mention in the close. It’s something I might explore at a later time, because I suspect “algorithm aversion” when applied to self-driving cars relates to both decision making and execution.

Another issue that became stark was the limit of the superiority of algorithms. In the first draft, I did not return to the Mercury missions for the close. It was a easy to talk of bumbling humans in the first space flights and how to guide them toward better use of algorithms. But that story was too neat, particularly given the particular example I had chosen. During the early flights there were plenty of times where the astronauts had to step in and save themselves. Perhaps if I had used a medical diagnosis or more typical decision scenario in the opening I could have written a cleaner article.

Regardless, the mix of operator and designer error (to use Perrow’s framing) has led me down a path of exploring how to use algorithms when the decision is idiosyncratic or is being made in a less developed system. The early space flights are one example, but strategic business decisions might be another. What is the right balance of algorithms and humans there? At this point, I’m planning for that to be the focus of my next Behavioral Scientist piece.

Dan Ariely’s Payoff: The Hidden Logic That Shapes Our Motivations

ArielyIf you have read Dan Ariely’s The Upside of Irrationality, there will be few surprises for you in his TED book Payoff: The Hidden Logic That Shapes Our Motivations. TED books are designed to be slightly longer explorations of topics from TED talks, but short enough to be read in one sitting. That makes it an easy, enjoyable, but not particularly deep read, with most of the results covered in The Upside. (Ariely’s TED talk can be viewed at the bottom of this post.)

The focus of Payoff is how we are motivated in the workplace, how easy it is to kill that motivation, and why we value the things we have made ourselves. It also touches on (in a slightly out-of-place and underdeveloped final chapter) how our actions are affected by what people will think about us after death.

Like The Upside of Irrationality, Ariely sways between interesting experimental results and not particularly convincing riffs on their application to the real world. Take the following example (the major experimental result that appears unique to Payoff). Workers in a semi-conductor plant in Israel were sent a message on day one of their four-day work stretch offering one of the following incentives if they met their target for the day:

  • A $30 bonus
  • A pizza voucher
  • A thank you text message from the boss
  • No message (the control group)

For people who were offered one of the three incentives, there was a boost to productivity on that day relative to the control: 4.9% for the cash group, 6.7% for the pizza group, and 6.6% for the thank you group.

The more interesting result was over the next three days. On day two, the group that had been incentivised with cash on day one had their productivity drop to 13.2% less than the control group. Absent the cash reward, they took their foot off the gas. On day three productivity was 6.2% worse. And on day four it was 2.9% worse. Over the four days, the productivity of the cash incentive group was 6.5% below that of the control. In contrast, the thank you group had no crash in productivity, with the pizza group somewhere in between. It seems the cash reward on day one, but not the other days, had sent a signal that day one was the only day when production mattered. Or the cash reward displaced some other form of motivation. What exactly is unclear.

Ariely turns the result into an attack on the idea that people work for pay and that more compensation will result in greater output. This is where Ariely’s riff and my take on the experimental results part.

I agree that there is more to work than merely the exchange of money for labour. Poorly designed incentives can backfire. You can crush motivation despite paying well. The way an incentive is designed can magnify or destroy its effect.

But Ariely sells the cash incentive short by making almost no comment on alternative designs. What if the bonus persisted, rather than being in place for only one day? How would a daily cash incentive perform against a canned thank you every day? What would productivity look like after a year?

I suspect Ariely is over-interpreting a narrow finding. The experiment was designed to demonstrate the poor structure of the existing incentive (the $30 bonus on day one) and to elicit an interesting effect, not to determine the best incentive structure. You only need to look at the overly creative ways people use to meet incentivised sales targets in financial services (e.g. Wells Fargo) to get a sense of how strongly people can be motivated by monetary bonuses. (Whether that is a good thing for the business is another matter. And to be honest, I haven’t actually checked that the Wells Fargo staff weren’t creating these fake accounts to receive more thank yous.)

So yes, think of motivation as being about more than money. Test whatever incentive systems you put in place. Test them over the long-term. But don’t start paying your staff in thank yous just yet.

Of those experiments reported in the Upside of Irrationality and repeated in Payoff, one of the more interesting is the destruction of motivation in a pointless task. People were paid to construct Lego Bionicles at a decreasing pay scale. After constructing one, they were then asked if they would like to construct another at a new lower rate. These people were grouped into two conditions. In one, their recently completed Bionicle would be placed to the side. In the other, the Bionicle would be destroyed in front of them and placed back into the box (the Sisyphic condition).

Those who saw their creations destroyed constructed less. Most notably, the decline in productivity in the Sisyphic group was strongest among those who liked making Bionicles, reducing their productivity to the level of those who couldn’t care less.

Other random thoughts on the book:

  • Ariely suggests that we value our food, gardens and houses less by getting others to take care of them for us, and suggests we should invest more ourselves (related to the IKEA effect). But what would the opportunity cost of this investment be?
  • Ariely takes a number of unfair pokes at Adam Smith and his story of the pin factory. Ariely suggests that specialisation and trade will destroy motivation as the person cannot see the whole (a la Marx), and that Smith’s idea is no longer relevant. I trust he makes his own pins.
  • One scenario where I felt the opposite inclination to Ariely was the following:

Imagine, for example, that you worked for me and I asked you to stay late three times over the next week to help complete a project ahead of deadline. At the end of the week, you will not have seen your family but will have come close to a caffeine overdose. As an expression of my gratitude I present you with one of two rewards. In option one, I tell you how much your extra hard work meant to me. I give you a warm and sincere hug and invite you and your family to dinner. In option two, I tell you that I have calculated your marginal contribution to the company’s bottom line, it totaled $27.800, and I tell you that I will give you a bonus of 5 percent of this amount ($1,390). Which scenario is more likely to maximise your goodwill toward the company and me, not just on that day, but moving forward? Which will inspire you to push extra hard to meet the next deadline?

AI in medicine: Outperforming humans since the 1970s

From an interesting a16z podcast episode Putting AI in Medicine, in Practice (I hope I got the correct names against who is saying what):

Mintu Turakhia (cardiologist at Stanford and Director of the Centre for Digital Health): AI is not new to medicine. Automated systems in healthcare have been described since the 1960s. And they went through various iterations of expert systems and neural networks and called many different things.

Hanne Tidnam: In what way would those show up in the 60s and 70s.

Mintu Turakhia: So at that time there was no high resolution, there weren’t too many sensors, and it was about a synthetic brain that could take what a patient describes as the inputs and what a doctor finds on the exam as the inputs.

Hanne Tidnam: Using verbal descriptions?

Mintu Turakhia: Yeah, basically words. People created, you know, what are called ontologies and classification structures. But you put in the ten things you felt and a computer would spit out the top 10 diagnoses in order of probability and even back then, they were outperforming sort of average physicians. So this is not a new concept.

This point about “average physicians” is interesting. In some circumstances you might be able to find someone who outperforms the AI. The truly extraordinary doctor. But most people are not treated by that star.

They continue:

Brandon Ballinger (CEO and founder of Cardiogram): So an interesting case study is the Mycin system which is from 1978 I believe. And so, this was an expert system trained at Stanford. It would take inputs that were just typed in manually and it would essentially try to predict what a pathologist would show. And it was put to the test against five pathologists. And it beat all five of them.

Hanne Tidnam: And it was already outperforming.

Brandon Ballinger: And it was already outperforming doctors, but when you go to the hospital they don’t use Mycin or anything similar. And I think this illustrates that sometimes the challenge isn’t just the technical aspects or the accuracy. It’s the deployment path, and so some of the issues around there are, OK, is there convenient way to deploy this to actual physicians. Who takes the risk? What’s the financial model for reimbursement? And so if you look at the way the financial incentives work there are some things that are backwards, right. For example, if you think about kindof a hospital from the CFO’s perspective, misdiagnosis actually earns them more money because when you misdiagnose you do follow up tests, right, and those, and our billing system is fee for service, so every little test that’s done is billed for.

Hanne Tidnam: But nobody wants to be giving out wrong diagnoses. So where is the incentive. The incentive is just in the system, the money that results from it.

Brandon Ballinger: No-one wants to give incorrect diagnosis. On the other hand there’s no budget to invest in making better diagnosis. And so I think that’s been part of the problem. And things like fee for value are interesting because now you’re paying people for, say, an accurate diagnosis, or for a reduction in hospitalisations, depending on the exact system, so I think that’s a case where accuracy is rewarded with greater payment, which sets up the incentives so that AI can actually win in this circumstance.

Vijay Pande (a16z General Partner): Where I think AI has come back at us with a force is it came to healthcare as as a hammer looking for a nail. What we’re trying to figure out is where you can implement it easily and safely with not too much friction and with not a lot of physicians going crazy, and where it’s going to be very very hard.

For better diagnoses, I’d be willing to drive a few physicians crazy.

The section on the types of error was also interesting:

Mintu Turakhia: There may be a point that it truly outperforms the cognitive abilities of physicians, and we have seen that with imaging so far. And some of the most promising aspects of the imaging studies and the EKG studies are that the confusion matrices, the way humans misclassify things, is recapitulated by the convolutional neural networks. …

A confusion matrix is a way to graph the errors and which directions they go. And so for rhythms on an EKG, a rhythm that’s truly atrial fibrillation could get classified as normal sinus rhythm, or atrial tachycardia, or super-ventricular tachycardia, the names are not important. What’s important is that the algorithms are making the same type of mistakes that humans are doing. It’s not that its making a mistake that’s necessarily more lethal, and just nonsensical so to speak. It recapitulates humans. And to me that’s the core thesis of AI in medicine, because if you can show that you are recapitulating human error, you’re not going to make it perfect, but that tells you that, in check and with control, you can allow this to scale safely since its liable to do what humans do. ….

Hanne Tidnam: And so you’re just saying it doesn’t have to be better. It just has to be making the same kinds of mistakes to feel that you can trust the decision maker.

Mintu Turakhia: Right. And you dip your toe in the water by having it be assistive. And then at some point we as a society will decide if it can go fully auto, right, fully autonomous without a doctor in the loop. That’s a societal issue. That’s not a technical hurdle at this point.

Certainly a heavy bias to the status quo. I’d certainly prefer something with better net performance even if some of the mistakes are different.

Is there a “backfire effect”?

I saw the answer hinted at in a paper released mid last-year (covered on WNYC), but Daniel Engber has now put together a more persuasive case:

Ten years ago last fall, Washington Post science writer Shankar Vedantam published an alarming scoop: The truth was useless.

His story started with a flyer issued by the Centers for Disease Control and Prevention to counter lies about the flu vaccine. The flyer listed half a dozen statements labeled either “true” or “false”—“Not everyone can take flu vaccine,” for example, or “The side effects are worse than the flu” —along with a paragraph of facts corresponding to each one. Vedantam warned the flyer’s message might be working in reverse. When social psychologists had asked people to read it in a lab, they found the statements bled together in their minds. Yes, the side effects are worse than the flu, they told the scientists half an hour later. That one was true—I saw it on the flyer.

This wasn’t just a problem with vaccines. According to Vedantam, a bunch of peer-reviewed experiments had revealed a somber truth about the human mind: Our brains are biased to believe in faulty information, and corrections only make that bias worse.

These ideas, and the buzzwords that came with them—filter bubbles, selective exposure, and the backfire effect—would be cited, again and again, as seismic forces pushing us to rival islands of belief.

Fast forward a few years:

When others tried to reproduce his the research [Ian Skurnik’s vaccine research], though, they didn’t always get the same result. Kenzie Cameron, a public health researcher and communications scholar at Northwestern’s Feinberg School of Medicine, tried a somewhat similar experiment in 2009. … “We found no evidence that presenting both facts and myths is counterproductive,” Cameron concluded in her paper, which got little notice when it was published in 2013.

There have been other failed attempts to reproduce the Skurnik, Yoon, and Schwarz finding. For a study that came out last June, Briony Swire, Ullrich Ecker, and “Debunking Handbook” co-author Stephan Lewandowsky showed college undergrads several dozen statements of ambiguous veracity (e.g. “Humans can regrow the tips of fingers and toes after they have been amputated”).  … But the new study found no sign of this effect.

And on science done right (well done Brendan Nyhan and Jason Reifler):

Brendan Nyhan and Jason Reifler described their study, called “When Corrections Fail,” as “the first to directly measure the effectiveness of corrections in a realistic context.” Its results were grim: When the researchers presented conservative-leaning subjects with evidence that cut against their prior points of view—that there were no stockpiled weapons in Iraq just before the U.S. invasion, for example—the information sometimes made them double-down on their pre-existing beliefs. …

He [Tom Wood] and [Ethan] Porter decided to do a blow-out survey of the topic. Instead of limiting their analysis to just a handful of issues—like Iraqi WMDs, the safety of vaccines, or the science of global warming—they tried to find backfire effects across 52 contentious issues. … They also increased the sample size from the Nyhan-Reifler study more than thirtyfold, recruiting more than 10,000 subjects for their five experiments.

In spite of all this effort, and to the surprise of Wood and Porter, the massive replication effort came up with nothing. That’s not to say that Wood and Porter’s subjects were altogether free of motivated reasoning.

The people in the study did give a bit more credence to corrections that fit with their beliefs; in those situations, the new information led them to update their positions more emphatically. But they never showed the effect that made the Nyhan-Reifler paper famous: People’s views did not appear to boomerang against the facts. Among the topics tested in the new research—including whether Saddam had been hiding WMDs—not one produced a backfire.

Nyhan and Reifler, in particular, were open to the news that their original work on the subject had failed to replicate. They ended up working with Wood and Porter on a collaborative research project, which came out last summer, and again found no sign of backfire from correcting misinformation. (Wood describes them as “the heroes of this story.”) Meanwhile, Nyhan and Reifler have found some better evidence of the effect, or something like it, in other settings. And another pair of scholars, Brian Schaffner and Cameron Roche, showed something that looks a bit like backfire in a recent, very large study of how Republicans and Democrats responded to a promising monthly jobs report in 2012. But when Nyhan looks at all the evidence together, he concedes that both the prevalence and magnitude of backfire effects could have been overstated and that it will take careful work to figure out exactly when and how they come in play.

Read Engber’s full article. It covers a lot more territory, including some interesting history on how the idea spread.

I have added this to the growing catalogue of readings on my critical behavioural economics and behavioural science reading list. (Daniel Engber makes a few appearances.)

Benartzi (and Lehrer’s) The Smarter Screen: Surprising Ways to Influence and Improve Online Behaviour

BenartziThe replication crisis has ruined my ability to relax while reading a book built on social psychology foundations. The rolling sequence of interesting but small sample and possibly not replicable findings leaves me somewhat on edge. Shlomo Benartzi’s (with Jonah Lehrer) The Smarter Screen: Surprising Ways to Influence and Improve Online Behavior (2015) is one such case.

Sure, I accept there is a non-zero probability that a 30 millisecond exposure to the Apple logo could make someone more creative than exposure to the IBM logo. Closing a menu after making my choice might make me more satisfied by giving me closure. Reading something in Comic Sans might lead me to think about it in a different way. But on net, most of these interesting results won’t hold up. Which? I don’t know.

That said, like a Malcolm Gladwell book, The Smarter Screen does have some interesting points and directed me to plenty of interesting material elsewhere. Just don’t bet your house on the parade of results being right.

The central thesis in The Smarter Screen is that since so many of our decisions are now made on screens, we should invest more time in designing these screens for better decision making. Agreed.

I saw Benartzi present about screen decision-making a few years ago, when he highlighted how some biases play out differently on screens compared to other mediums. For example, he suggested that defaults were less sticky on screens (we are quick to un-check the pre-checked box). While that particular example didn’t appear in The Smarter Screen, other examples followed a similar theme.

As a start, we read much faster on screens. Benartzi gives the example of a test with a written instruction at the front of the test to not answer the following questions. Experimental subjects suffered double rate of failure when on a computer – up from around 20% to 46% – skipping over the instruction and answering questions they should not have answered.

People are also more truthful on screens. For instance, people report more health problems and drug use to screens. Men report less sexual partners, women more. We order pizza closer to our preferences (no embarrassment about those idiosyncratic tastes).

Screens can also exacerbate biases as the digital format allows for more extreme environments, such as massive ranges of products. The thousands of each type of pen on Amazon or the maze of healthcare plans on are typically not seen in stores or in hard copy.

The choice overload experienced on screens is a theme through the book, with many of Benartzi’s suggestions focused on making the choice manageable. Use categories to break up the choice. Use tournaments where small sets of comparisons are presented and the winners face off against each other (do you need to assume transitivity of preferences for this to work?). All sound suggestions worth trying.

One interesting complaint of Benartzi’s is about Amazon’s massive range. They have over 1,000 black roller-ball pens! An academic critiquing one of the world’s largest companies built on offering massive choice (and with a reputation for A/B testing) is somewhat circumspect. Maybe Amazon could be even bigger? (Interestingly, after critiquing Amazon for not allowing “closure” and reducing satisfaction by suggesting similar products after purchase, Benartzi suggests Amazon already knows this issue).

The material on choice overload reflects Benartzi’s habit through the book of giving a relatively uncritical discussion of his preferred underlying literature. Common examples such as the jam experiment are trotted out, with no mention of the failed replications or the meta-analysis showing a mean effect of changing the number of choices of zero. Benartzi’s message that we need to test these ideas covers him to a degree, but a more sceptical reporting of the literature would have been helpful.

Some other sections have a similar shallowness. The material on subliminal advertising ignores the debates around it. Some of the cited studies have all the hallmarks of a spurious result, with multiple comparisons and effects only under specific conditions. For example, people are more likely to buy Mountain Dew if the Mountain Dew ad played at 10 times speed is preceded by an ad for a dissimilar product like a Honda. There is no effect when an ad for a (similar) Hummer is played first. Really?

Or take disfluency and the study by Adam Alter and friends. Forty students were exposed to two versions of the cognitive reflection task. A typical question in the cognitive reflection task is the following:

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?

The two versions differed in that one used a small light grey font that made the questions hard to read. Those exposed to the harder to read questions achieved higher scores. Exciting stuff

But 16 replications involving a total of around 7,000 people found nothing (Terry Burnham discusses these replications in more detail here). Here’s how Benartzi deals with the replications:

It’s worth pointing out, however, that not every study looking at disfluent fonts gets similar results. For reasons that remain unclear, many experiments have found little to no effect when counterintuitive math problems, such as those in the CRT, are printed in hard-to-read letters. While people take longer to answer the questions, this extra time doesn’t lead to higher scores. Clearly, more research is needed.

What is Benartzi’s benchmark for accepting that a cute experimental result hasn’t stood up to further examination and that we can move on to more prospective research? Sixteen studies involving 7,000 people in total showing no effect, one study with 40 people showing a result. The jury is still out?

One feeling I had at the end of the book was that the proposed solutions were “small”. Behavioural scientists are often criticised for proposing small solutions, which is generally unfair given the low cost of many of the interventions. The return on investment can be massive. But the absence of new big ideas at the close of the book raised the question (at least for me) of where the next big result can be.

Benartzi was, of course, at the centre of one of the greatest triumphs in the application of behavioural science – the Save More Tomorrow plan he developed with Richard Thaler. Many of the other large successful applications of behavioural science rely on the same mechanism, defaults.

So when Benartzi’s closing idea is to create an app for smartphones to increase retirement saving, it feels slightly underwhelming. The app would digitally alter portraits of the user to make them look old and help relate them to their future self. The app would make saving effortless through pre-filled information and the like. Just click a button. But you first have to get people to download it. What is the marginal effect on these people already motivated enough to download the app? (Although here is some tentative evidence that at least among certain cohorts this effect is above zero.)

Other random thoughts:

  • One important thread through the book is the gap between identifying behaviours we want to change and changing them. Feedback is simply not enough. Think of a bathroom scale. It is cheap, available, accurate, and most people have a good idea of their weight. Bathroom scales haven’t stopped the increase in obesity.
  • Benartzi discusses the potential of query theory, which proposes that people arrive at decisions by asking themselves a series of internal questions. How can we shape decisions by posing the questions externally?
  • Benartzi references a study in which 255 students received an annual corporate report. One report was aesthetically pleasing, the other less attractive. Despite both reports containing the same information, the students gave a higher valuation for the company with the attractive report (more than double). Bernartzi suggests the valuations should have been the same, but I am not sure. In the same way that wasteful advertising can be a signal that the brand has money and will stick around, the attractive report provides a signal about the company. If a company doesn’t have the resources to make its report look decent, how much should you trust the data and claims in it?
  • Does The Smarter Screen capture a short period where screens have their current level of importance? Think of ordering a pizza. Ten years ago we might have phoned, been given an estimated time of delivery and then waited. Today we can order our pizza on our smartphone, then watch it move through the process of construction, cooking and delivery. Shortly (if you’re not already doing this), you’ll simply order your pizza through your Alexa.
  • Benartzi discusses how we could test people through a series of gambles to determine their loss aversion score. When people later face decisions, an app with knowledge of their level of loss aversion could help guide their decision. I have a lot of doubt about the ability to get a specific, stable and useful measure of loss aversion for a particular person, and am a fan of the approach of Greg Davies to the bigger question of how we should consider attitudes to risk and short-term behavioural responses.
  • In the pointers at the end of one of the chapters, Benartzi asks “Are you trusting my advice too much? While there is a lot of research to back up my recommendations, it is equally important to test the actual user experience and choice quality and adjust the design accordingly.” Fair point!