People should use their judgment … except they’re often lousy at it

Author

Jason Collins

Published

March 14, 2018

My Behavioral Scientist article, Don’t Touch The Computer was in part a reaction to Andrew McAfee and Eric Brynjolfsson’s book The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies. In particular, I felt their story of freestyle chess as an illustration of how humans and machines can work together was somewhat optimistic.

I have just read McAfee and Brynjolfsson’s Machine, Platform, Crowd: Harnessing Our Digital Future. Chapter 2, titled The Hardest Thing to Accept About Ourselves, runs a line somewhat closer to mine. Here are some snippets:

[L]et people develop and exercise their intuition and judgment in order to make smart decisions, while the computers take care of the math and record keeping. We’ve heard about and seen this division of labor between minds and machines so often that we call it the “standard partnership.”
The standard partnership is compelling, but sometimes it doesn’t work very well at all. Getting rid of human judgments altogether—even those from highly experienced and credentialed people—and relying solely on numbers plugged into formulas, often yields better results.

Here’s one example:

Sociology professor Chris Snijders used 5,200 computer equipment purchases by Dutch companies to build a mathematical model predicting adherence to budget, timeliness of delivery, and buyer satisfaction with each transaction. He then used this model to predict these outcomes for a different set of transactions taking place across several different industries, and also asked a group of purchasing managers in these sectors to do the same. Snijders’s model beat the managers, even the above-average ones. He also found that veteran managers did no better than newbies, and that, in general, managers did no better looking at transactions within their own industry than at distant ones.

This is a general finding:

A team led by psychologist William Grove went through 50 years of literature looking for published, peer-reviewed examples of head-to-head comparisons of clinical and statistical prediction (that is, between the judgment of experienced, “expert” humans and a 100% data-driven approach) in the areas of psychology and medicine. They found 136 such studies, covering everything from prediction of IQ to diagnosis of heart disease. In 48% of them, there was no significant difference between the two; the experts, in other words, were on average no better than the formulas. A much bigger blow to the notion of human superiority in judgment came from the finding that in 46% of the studies considered, the human experts actually performed significantly worse than the numbers and formulas alone. This means that people were clearly superior in only 6% of cases. And the authors concluded that in almost all of the studies where humans did better, “the clinicians received more data than the mechanical prediction.”

Despite this victory, it seems a good idea to check the algorithm’s output.

In many cases … it’s a good idea to have a person check the computer’s decisions to make sure they make sense. Thomas Davenport, a longtime scholar of analytics and technology, calls this taking a “look out of the window.” The phrase is not simply an evocative metaphor. It was inspired by an airline pilot he met who described how he relied heavily on the plane’s instrumentation but found it essential to occasionally visually scan the skyline himself.

But …

As companies adopt this approach, though, they will need to be careful. Because we humans are so fond of our judgment, and so overconfident in it, many of us, if not most, will be too quick to override the computers, even when their answer is better. But Chris Snijders, who conducted the research on purchasing managers’ predictions highlighted earlier in the chapter, found that “what you usually see is [that] the judgment of the aided experts is somewhere in between the model and the unaided expert. So the experts get better if you give them the model. But still the model by itself performs better.”

So, measure which is best:

We support having humans in the loop for exactly the reasons that Meehl and Davenport described, but we also advocate that companies “keep score” whenever possible—that they track the accuracy of algorithmic decisions versus human decisions over time. If the human overrides do better than the baseline algorithm, things are working as they should. If not, things need to change, and the first step is to make people aware of their true success rate.

Accept the result will often be to defer to the algorithm:

Most of us have a lot of faith in human intuition, judgment, and decision-making ability, especially our own …. But the evidence on this subject is so clear as to be overwhelming: data-driven, System 2 decisions are better than those that arise out of our brains’ blend of System 1 and System 2 in the majority of cases where both options exist. It’s not that our decisions and judgment are worthless; it’s that that they can be improved on. The broad approaches we’ve seen here—letting algorithms and computer systems make the decisions, sometimes with human judgment as an input, and letting the people override them when appropriate—are ways to do this.

And from the chapter summary:

The evidence is overwhelming that, whenever the option is available, relying on data and algorithms alone usually leads to better decisions and forecasts than relying on the judgment of even experienced and “expert” humans.

Many decisions, judgments, and forecasts now made by humans should be turned over to algorithms. In some cases, people should remain in the loop to provide commonsense checks. In others, they should be taken out of the loop entirely.

In other cases, subjective human judgments should still be used, but in an inversion of the standard partnership: the judgments should be quantified and included in quantitative analyses.

…
Algorithms are far from perfect. If they are based on inaccurate or biased data, they will make inaccurate or biased decisions. These biases can be subtle and unintended. The criterion to apply is not whether the algorithms are flawless, but whether they outperform the available alternatives on the relevant metrics, and whether they can be improved over time.

As for the remainder of the book, I have mixed views. I enjoyed the chapters on machines. The four chapters on platforms and first two on crowds were less interesting, and much could have been written five years ago (e.g. the stories on Wikipedia, Linux, two-sided platforms). The closing two chapters on crowds, which discussed decentralisation, complete contracts and the future of the firm were, however, excellent.