Human-AI collaboration: is it better when the human is asleep at the wheel?

Author

Jason Collins

Published

December 5, 2024

In his book Co-Intelligence: Living and Working with AI, Ethan Mollick describes an experiment by Fabrizio Dell’Acqua:

In a different paper, Fabrizio Dell’Acqua shows why relying too much on AI can backfire. He found that recruiters who used high-quality AI became lazy, careless, and less skilled in their own judgment. They missed out on some brilliant applicants and made worse decisions than recruiters who used low-quality AI or no AI at all.

He hired 181 professional recruiters and gave them a tricky task: to evaluate 44 job applications based on their math ability. The data came from an international test of adult skills, so the math scores were not obvious from the résumés. Recruiters were given different levels of AI assistance: some had good or bad AI support, and some had none. He measured how accurate, how fast, how hardworking, and how confident they were.

Recruiters with higher-quality AI were worse than recruiters with lower-quality AI. They spent less time and effort on each résumé, and blindly followed the AI recommendations. They also did not improve over time. On the other hand, recruiters with lower-quality AI were more alert, more critical, and more independent. They improved their interaction with the AI and their own skills. Dell’Acqua developed a mathematical model to explain the trade-off between AI quality and human effort. When the AI is very good, humans have no reason to work hard and pay attention. They let the AI take over instead of using it as a tool, which can hurt human learning, skill development, and productivity. He called this “falling asleep at the wheel.”

What are the costs of falling asleep at the wheel? We can get a better sense from the as yet unpublished working paper. (I expect it will be published somewhere reasonably prestigious before too long.)

The “bad AI” had an accuracy of 75%. The “good AI” had an accuracy of 85%. What was the accuracy of the recruiters? My reading of Table 4 of the paper is that, without any AI support, the recruiters had 72.3% accuracy. With the bad AI that was 75.4%, and with good AI was 74.4%. So on the raw outcome, about the same across all three treatments. (I’ve based these calculations on the assumption that the results come from a linear regression for columns (1) and (2). The author could have used a logistic regression given the binary dependent variable, but the magnitude of the coefficients suggests that isn’t the case.)

But think about what we could achieve if we just eliminated the recruiters. The recruiters with the bad AI add nothing. Their deviations from the AI recommendations are a wash. The recruiters with the good AI intervene enough to degrade the performance from 85% to 74%. If anything, the recruiters with the bad AI aren’t asleep enough! If they didn’t do a thing, they’d get 85% accuracy. Instead of thinking about getting recruiters to pay attention, we should remove them from simple prediction tasks like this.

The underperformance of human-AI combinations is a common theme across the literature on statistical versus human prediction. Combine a human with a good algorithm and you will improve the human’s performance. However, their performance will still be below the level of the algorithm alone. (I’ve written in Behavioural Scientist on this previously.)

Before moving on, I should point out that the measure of decision quality highlighted in the preregistration was not the decision to interview the candidate, but rather a measure of confidence. (I can’t check this as the pre-registration is embargoed.) For each decision, recruiters were asked to rate their confidence on a 1 to 5 scale. Using this confidence measure, the recruiters with the bad AI performed significantly better than those with the good AI (scraping under the 0.05 threshold once some additional controls are added). I can’t confirm from the paper whether the good AI model would outperform on this measure - or even whether it generates measures of confidence - but I’d be very surprised if a model wouldn’t outperform.

Understanding AI quality

Having said the above, it’s unclear what is driving the headline result and whether it would replicate. The recruiters were given one of the following descriptions. For the good AI:

The AI tool that will support you has been performing very well in prior analysis and we have been very pleased with the candidates selected. However, it made a few mistakes for candidates that were close calls.

We reviewed the algorithm’s recommendations using performance data, and we found that the vast majority of AI’s recommendations about whether to interview a candidate or not were correct (about 85% of cases).

Those who were given the bad AI read:

The AI tool that will support you has been performing well in prior analysis and we have been pleased with the candidates selected. However, it made some mistakes for candidates that were close calls.

We reviewed the algorithm’s recommendations using performance data, and we found that the large majority of AI’s recommendations about whether to interview a candidate or not were correct (about 75% of cases).

The differences are performed “very well” versus “well”, “very pleased” versus “pleased”, making “a few” versus “some” mistakes, “vast majority” versus “large majority”, and “85%” versus “75%”.

Which change or changes are driving the effect? We can’t tell. I lean toward the accuracy numbers having no effect. People tend not to react differently to numbers in those ranges. The words might be doing something, but which ones? The framing of the mistakes? How pleased the developers are?

Further, is the problem identified in the experiment giving people a good AI or if it is describing it poorly? We could easily have used the bad AI description for the good AI (except the number, which I predict has no effect). Would we see an effect if we gave the bad AI description (except for the number) for the good AI? If not, we’ve solved the asleep at the wheel problem!

I’d love to see a replication of this experiment, mixing the combinations of words to understand better what people are hooking into. My hunch is that the differences in accuracy wouldn’t replicate. This paper has some of the classic signs: a weak treatment and a p-value scraping under 0.05. (I’m more optimistic about time and effort.)