AI in medicine: Outperforming humans since the 1970s

From an interesting a16z podcast episode Putting AI in Medicine, in Practice (I hope I got the correct names against who is saying what):

Mintu Turakhia (cardiologist at Stanford and Director of the Centre for Digital Health): AI is not new to medicine. Automated systems in healthcare have been described since the 1960s. And they went through various iterations of expert systems and neural networks and called many different things.

Hanne Tidnam: In what way would those show up in the 60s and 70s.

Mintu Turakhia: So at that time there was no high resolution, there weren’t too many sensors, and it was about a synthetic brain that could take what a patient describes as the inputs and what a doctor finds on the exam as the inputs.

Hanne Tidnam: Using verbal descriptions?

Mintu Turakhia: Yeah, basically words. People created, you know, what are called ontologies and classification structures. But you put in the ten things you felt and a computer would spit out the top 10 diagnoses in order of probability and even back then, they were outperforming sort of average physicians. So this is not a new concept.

This point about “average physicians” is interesting. In some circumstances you might be able to find someone who outperforms the AI. The truly extraordinary doctor. But most people are not treated by that star.

They continue:

Brandon Ballinger (CEO and founder of Cardiogram): So an interesting case study is the Mycin system which is from 1978 I believe. And so, this was an expert system trained at Stanford. It would take inputs that were just typed in manually and it would essentially try to predict what a pathologist would show. And it was put to the test against five pathologists. And it beat all five of them.

Hanne Tidnam: And it was already outperforming.

Brandon Ballinger: And it was already outperforming doctors, but when you go to the hospital they don’t use Mycin or anything similar. And I think this illustrates that sometimes the challenge isn’t just the technical aspects or the accuracy. It’s the deployment path, and so some of the issues around there are, OK, is there convenient way to deploy this to actual physicians. Who takes the risk? What’s the financial model for reimbursement? And so if you look at the way the financial incentives work there are some things that are backwards, right. For example, if you think about kindof a hospital from the CFO’s perspective, misdiagnosis actually earns them more money because when you misdiagnose you do follow up tests, right, and those, and our billing system is fee for service, so every little test that’s done is billed for.

Hanne Tidnam: But nobody wants to be giving out wrong diagnoses. So where is the incentive. The incentive is just in the system, the money that results from it.

Brandon Ballinger: No-one wants to give incorrect diagnosis. On the other hand there’s no budget to invest in making better diagnosis. And so I think that’s been part of the problem. And things like fee for value are interesting because now you’re paying people for, say, an accurate diagnosis, or for a reduction in hospitalisations, depending on the exact system, so I think that’s a case where accuracy is rewarded with greater payment, which sets up the incentives so that AI can actually win in this circumstance.

Vijay Pande (a16z General Partner): Where I think AI has come back at us with a force is it came to healthcare as as a hammer looking for a nail. What we’re trying to figure out is where you can implement it easily and safely with not too much friction and with not a lot of physicians going crazy, and where it’s going to be very very hard.

For better diagnoses, I’d be willing to drive a few physicians crazy.

The section on the types of error was also interesting:

Mintu Turakhia: There may be a point that it truly outperforms the cognitive abilities of physicians, and we have seen that with imaging so far. And some of the most promising aspects of the imaging studies and the EKG studies are that the confusion matrices, the way humans misclassify things, is recapitulated by the convolutional neural networks. …

A confusion matrix is a way to graph the errors and which directions they go. And so for rhythms on an EKG, a rhythm that’s truly atrial fibrillation could get classified as normal sinus rhythm, or atrial tachycardia, or super-ventricular tachycardia, the names are not important. What’s important is that the algorithms are making the same type of mistakes that humans are doing. It’s not that its making a mistake that’s necessarily more lethal, and just nonsensical so to speak. It recapitulates humans. And to me that’s the core thesis of AI in medicine, because if you can show that you are recapitulating human error, you’re not going to make it perfect, but that tells you that, in check and with control, you can allow this to scale safely since its liable to do what humans do. ….

Hanne Tidnam: And so you’re just saying it doesn’t have to be better. It just has to be making the same kinds of mistakes to feel that you can trust the decision maker.

Mintu Turakhia: Right. And you dip your toe in the water by having it be assistive. And then at some point we as a society will decide if it can go fully auto, right, fully autonomous without a doctor in the loop. That’s a societal issue. That’s not a technical hurdle at this point.

Certainly a heavy bias to the status quo. I’d certainly prefer something with better net performance even if some of the mistakes are different.

Is there a “backfire effect”?

I saw the answer hinted at in a paper released mid last-year (covered on WNYC), but Daniel Engber has now put together a more persuasive case:

Ten years ago last fall, Washington Post science writer Shankar Vedantam published an alarming scoop: The truth was useless.

His story started with a flyer issued by the Centers for Disease Control and Prevention to counter lies about the flu vaccine. The flyer listed half a dozen statements labeled either “true” or “false”—“Not everyone can take flu vaccine,” for example, or “The side effects are worse than the flu” —along with a paragraph of facts corresponding to each one. Vedantam warned the flyer’s message might be working in reverse. When social psychologists had asked people to read it in a lab, they found the statements bled together in their minds. Yes, the side effects are worse than the flu, they told the scientists half an hour later. That one was true—I saw it on the flyer.

This wasn’t just a problem with vaccines. According to Vedantam, a bunch of peer-reviewed experiments had revealed a somber truth about the human mind: Our brains are biased to believe in faulty information, and corrections only make that bias worse.

These ideas, and the buzzwords that came with them—filter bubbles, selective exposure, and the backfire effect—would be cited, again and again, as seismic forces pushing us to rival islands of belief.

Fast forward a few years:

When others tried to reproduce his the research [Ian Skurnik’s vaccine research], though, they didn’t always get the same result. Kenzie Cameron, a public health researcher and communications scholar at Northwestern’s Feinberg School of Medicine, tried a somewhat similar experiment in 2009. … “We found no evidence that presenting both facts and myths is counterproductive,” Cameron concluded in her paper, which got little notice when it was published in 2013.

There have been other failed attempts to reproduce the Skurnik, Yoon, and Schwarz finding. For a study that came out last June, Briony Swire, Ullrich Ecker, and “Debunking Handbook” co-author Stephan Lewandowsky showed college undergrads several dozen statements of ambiguous veracity (e.g. “Humans can regrow the tips of fingers and toes after they have been amputated”).  … But the new study found no sign of this effect.

And on science done right (well done Brendan Nyhan and Jason Reifler):

Brendan Nyhan and Jason Reifler described their study, called “When Corrections Fail,” as “the first to directly measure the effectiveness of corrections in a realistic context.” Its results were grim: When the researchers presented conservative-leaning subjects with evidence that cut against their prior points of view—that there were no stockpiled weapons in Iraq just before the U.S. invasion, for example—the information sometimes made them double-down on their pre-existing beliefs. …

He [Tom Wood] and [Ethan] Porter decided to do a blow-out survey of the topic. Instead of limiting their analysis to just a handful of issues—like Iraqi WMDs, the safety of vaccines, or the science of global warming—they tried to find backfire effects across 52 contentious issues. … They also increased the sample size from the Nyhan-Reifler study more than thirtyfold, recruiting more than 10,000 subjects for their five experiments.

In spite of all this effort, and to the surprise of Wood and Porter, the massive replication effort came up with nothing. That’s not to say that Wood and Porter’s subjects were altogether free of motivated reasoning.

The people in the study did give a bit more credence to corrections that fit with their beliefs; in those situations, the new information led them to update their positions more emphatically. But they never showed the effect that made the Nyhan-Reifler paper famous: People’s views did not appear to boomerang against the facts. Among the topics tested in the new research—including whether Saddam had been hiding WMDs—not one produced a backfire.

Nyhan and Reifler, in particular, were open to the news that their original work on the subject had failed to replicate. They ended up working with Wood and Porter on a collaborative research project, which came out last summer, and again found no sign of backfire from correcting misinformation. (Wood describes them as “the heroes of this story.”) Meanwhile, Nyhan and Reifler have found some better evidence of the effect, or something like it, in other settings. And another pair of scholars, Brian Schaffner and Cameron Roche, showed something that looks a bit like backfire in a recent, very large study of how Republicans and Democrats responded to a promising monthly jobs report in 2012. But when Nyhan looks at all the evidence together, he concedes that both the prevalence and magnitude of backfire effects could have been overstated and that it will take careful work to figure out exactly when and how they come in play.

Read Engber’s full article. It covers a lot more territory, including some interesting history on how the idea spread.

I have added this to the growing catalogue of readings on my critical behavioural economics and behavioural science reading list. (Daniel Engber makes a few appearances.)

Benartzi (and Lehrer’s) The Smarter Screen: Surprising Ways to Influence and Improve Online Behaviour

BenartziThe replication crisis has ruined my ability to relax while reading a book built on social psychology foundations. The rolling sequence of interesting but small sample and possibly not replicable findings leaves me somewhat on edge. Shlomo Benartzi’s (with Jonah Lehrer) The Smarter Screen: Surprising Ways to Influence and Improve Online Behavior (2015) is one such case.

Sure, I accept there is a non-zero probability that a 30 millisecond exposure to the Apple logo could make someone more creative than exposure to the IBM logo. Closing a menu after making my choice might make me more satisfied by giving me closure. Reading something in Comic Sans might lead me to think about it in a different way. But on net, most of these interesting results won’t hold up. Which? I don’t know.

That said, like a Malcolm Gladwell book, The Smarter Screen does have some interesting points and directed me to plenty of interesting material elsewhere. Just don’t bet your house on the parade of results being right.

The central thesis in The Smarter Screen is that since so many of our decisions are now made on screens, we should invest more time in designing these screens for better decision making. Agreed.

I saw Benartzi present about screen decision-making a few years ago, when he highlighted how some biases play out differently on screens compared to other mediums. For example, he suggested that defaults were less sticky on screens (we are quick to un-check the pre-checked box). While that particular example didn’t appear in The Smarter Screen, other examples followed a similar theme.

As a start, we read much faster on screens. Benartzi gives the example of a test with a written instruction at the front of the test to not answer the following questions. Experimental subjects suffered double rate of failure when on a computer – up from around 20% to 46% – skipping over the instruction and answering questions they should not have answered.

People are also more truthful on screens. For instance, people report more health problems and drug use to screens. Men report less sexual partners, women more. We order pizza closer to our preferences (no embarrassment about those idiosyncratic tastes).

Screens can also exacerbate biases as the digital format allows for more extreme environments, such as massive ranges of products. The thousands of each type of pen on Amazon or the maze of healthcare plans on HealthCare.gov are typically not seen in stores or in hard copy.

The choice overload experienced on screens is a theme through the book, with many of Benartzi’s suggestions focused on making the choice manageable. Use categories to break up the choice. Use tournaments where small sets of comparisons are presented and the winners face off against each other (do you need to assume transitivity of preferences for this to work?). All sound suggestions worth trying.

One interesting complaint of Benartzi’s is about Amazon’s massive range. They have over 1,000 black roller-ball pens! An academic critiquing one of the world’s largest companies built on offering massive choice (and with a reputation for A/B testing) is somewhat circumspect. Maybe Amazon could be even bigger? (Interestingly, after critiquing Amazon for not allowing “closure” and reducing satisfaction by suggesting similar products after purchase, Benartzi suggests Amazon already knows this issue).

The material on choice overload reflects Benartzi’s habit through the book of giving a relatively uncritical discussion of his preferred underlying literature. Common examples such as the jam experiment are trotted out, with no mention of the failed replications or the meta-analysis showing a mean effect of changing the number of choices of zero. Benartzi’s message that we need to test these ideas covers him to a degree, but a more sceptical reporting of the literature would have been helpful.

Some other sections have a similar shallowness. The material on subliminal advertising ignores the debates around it. Some of the cited studies have all the hallmarks of a spurious result, with multiple comparisons and effects only under specific conditions. For example, people are more likely to buy Mountain Dew if the Mountain Dew ad played at 10 times speed is preceded by an ad for a dissimilar product like a Honda. There is no effect when an ad for a (similar) Hummer is played first. Really?

Or take disfluency and the study by Adam Alter and friends. Forty students were exposed to two versions of the cognitive reflection task. A typical question in the cognitive reflection task is the following:

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?

The two versions differed in that one used a small light grey font that made the questions hard to read. Those exposed to the harder to read questions achieved higher scores. Exciting stuff

But 16 replications involving a total of around 7,000 people found nothing (Terry Burnham discusses these replications in more detail here). Here’s how Benartzi deals with the replications:

It’s worth pointing out, however, that not every study looking at disfluent fonts gets similar results. For reasons that remain unclear, many experiments have found little to no effect when counterintuitive math problems, such as those in the CRT, are printed in hard-to-read letters. While people take longer to answer the questions, this extra time doesn’t lead to higher scores. Clearly, more research is needed.

What is Benartzi’s benchmark for accepting that a cute experimental result hasn’t stood up to further examination and that we can move on to more prospective research? Sixteen studies involving 7,000 people in total showing no effect, one study with 40 people showing a result. The jury is still out?

One feeling I had at the end of the book was that the proposed solutions were “small”. Behavioural scientists are often criticised for proposing small solutions, which is generally unfair given the low cost of many of the interventions. The return on investment can be massive. But the absence of new big ideas at the close of the book raised the question (at least for me) of where the next big result can be.

Benartzi was, of course, at the centre of one of the greatest triumphs in the application of behavioural science – the Save More Tomorrow plan he developed with Richard Thaler. Many of the other large successful applications of behavioural science rely on the same mechanism, defaults.

So when Benartzi’s closing idea is to create an app for smartphones to increase retirement saving, it feels slightly underwhelming. The app would digitally alter portraits of the user to make them look old and help relate them to their future self. The app would make saving effortless through pre-filled information and the like. Just click a button. But you first have to get people to download it. What is the marginal effect on these people already motivated enough to download the app? (Although here is some tentative evidence that at least among certain cohorts this effect is above zero.)

Other random thoughts:

  • One important thread through the book is the gap between identifying behaviours we want to change and changing them. Feedback is simply not enough. Think of a bathroom scale. It is cheap, available, accurate, and most people have a good idea of their weight. Bathroom scales haven’t stopped the increase in obesity.
  • Benartzi discusses the potential of query theory, which proposes that people arrive at decisions by asking themselves a series of internal questions. How can we shape decisions by posing the questions externally?
  • Benartzi references a study in which 255 students received an annual corporate report. One report was aesthetically pleasing, the other less attractive. Despite both reports containing the same information, the students gave a higher valuation for the company with the attractive report (more than double). Bernartzi suggests the valuations should have been the same, but I am not sure. In the same way that wasteful advertising can be a signal that the brand has money and will stick around, the attractive report provides a signal about the company. If a company doesn’t have the resources to make its report look decent, how much should you trust the data and claims in it?
  • Does The Smarter Screen capture a short period where screens have their current level of importance? Think of ordering a pizza. Ten years ago we might have phoned, been given an estimated time of delivery and then waited. Today we can order our pizza on our smartphone, then watch it move through the process of construction, cooking and delivery. Shortly (if you’re not already doing this), you’ll simply order your pizza through your Alexa.
  • Benartzi discusses how we could test people through a series of gambles to determine their loss aversion score. When people later face decisions, an app with knowledge of their level of loss aversion could help guide their decision. I have a lot of doubt about the ability to get a specific, stable and useful measure of loss aversion for a particular person, and am a fan of the approach of Greg Davies to the bigger question of how we should consider attitudes to risk and short-term behavioural responses.
  • In the pointers at the end of one of the chapters, Benartzi asks “Are you trusting my advice too much? While there is a lot of research to back up my recommendations, it is equally important to test the actual user experience and choice quality and adjust the design accordingly.” Fair point!

Best books I read in 2017

The best books I read in 2017 – generally released in other years – are below (in no particular order). Where I have reviewed, the link leads to that review.

Don Norman’s The Design of Everyday Things (2013): In a world where so much attention is on technology, a great discussion of the need to consider the psychology of the users.
David Epstein’s The Sports Gene: Inside the Science of Extraordinary Athletic Performance (2013): The best examination of nature versus nurture as it relates to performance that I have read. I will write about The Sports Gene some time in 2018.
Cathy O’Neil’s Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (2016) – Although O’Neil is too quick to turn back to all-too-flawed humans as the solution to problematic algorithms, her critique has bite.
Kasparov’s Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins (2017) – Deep Thinking does not contain much deep analysis of human versus machine intelligence, but the story of Kasparov’s battle against Deep Blue is worth reading.
Gerd Gigerenzer, Peter Todd and the ABC Research Group’s Simple Heuristics That Make Us Smart (1999) – A re-read for me (and now a touch dated), but a book worth revisiting.
Pedro Domingos The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World (2015) – On the list for the five excellent chapters on the various “tribes” of machine learning. The rest is either techno-Panglossianism or beyond my domain of expertise to assess.
Christian and Griffiths’s Algorithms to Live By: The Computer Science of Human Decisions (2016) – An excellent analysis of decision making, with the benchmark the solutions of computer science. As they say, “the best algorithms are all about doing what makes the most sense in the least amount of time, which by no means involves giving careful consideration to every factor and pursuing every computation to the end.”
William Finnegan’s Barbarian Days: A Surfing Life – Simply awesome, although I suspect of more interest to surfers (that said, it did win a Pulitzer). I also read a lot of great fiction during the year. Fahrenheit 451 and The Dice Man were among those I enjoyed the most.

Psychology as a knight in shining armour, and other thoughts by Paul Ormerod on Thaler’s Misbehaving

I have been meaning to write some notes on Richard Thaler’s Misbehaving: The Making of Behavioral Economics for some time, but having now come across a review by Paul Ormerod (ungated pdf) – together with his perspective on the position of behavioural economics in the discipline – I feel somewhat less need. Below are some interesting sections of Ormerod’s review.

First, on the incorporation of psychology into economics:

With a few notable exceptions, psychologists themselves have not engaged with the area. ‘Behavioral economics has turned out to be primarily a field in which economists read the work of psychologists and then go about their business of doing research independently’ (p. 179). One reason for this which Thaler gives is that few psychologists have any attachment to the rational choice model, so studying deviations from it is not interesting. Another is that ‘the study of “applied” problems in psychology has traditionally been considered a low status activity’ (p. 180).

It is fashionable in many social science circles to deride economics, and to imagine that if only these obstinate and ideological economists would import social science theories into the discipline, all would be well. All manner of things would be well, for somehow these theories would not only be scientifically superior, but their policy implications would lead to the disappearance of all sorts of evils, such as austerity and even neo-liberalism itself. This previous sentence deliberately invokes a caricature, but one which will be all too recognisable to economists in Anglo-Saxon universities who have dealings with their colleagues in the wider social sciences.

A recent article in Science (Open Science Collaboration 2015) certainly calls into question whether psychology can perform this role of knight in shining armour. A team of no fewer than 270 co-authors attempted to replicate the results of 100 experiments published in leading psychology journals. … [O]nly 36 per cent of the attempted replications led to results which were statistically significant. Further, the average size of the effects found in the replicated studies was only half that reported in the original studies. …

Either the original or the replication work could be flawed, or crucial differences between the two might be unappreciated. … So the strategy adopted by behavioural economists of choosing for themselves which bits of psychology to use seems eminently sensible.

On generalising behavioural economics:

The empirical results obtained in behavioural economics are very interesting and some, at least, seem to be well established. But the inherent indeterminacy discussed above is the main reason for unease with the area within mainstream economics. Alongside Misbehaving, any economist interested in behavioural economics should read the symposium on bounded rationality in the June 2013 edition of the Journal of Economic Literature. …

In a paper titled ‘Bounded-Rationality Models: Tasks to Become Intellectually Competitive’, Harstad and Selten make a key point that although models have been elaborated which incorporate insights of boundedly rational behaviour, ‘the collection of alternative models has made little headway supplanting the dominant paradigm’ (2013, p. 496). Crawford’s symposium paper notes that ‘in most settings, there is an enormous number of logically possible models… that deviate from neoclassical models. In attempting to improve upon neoclassical models, it is essential to have some principled way of choosing among alternatives’ (2013, p. 524). He continues further on the same page ‘to improve on a neoclassical model, one must identify systematic deviations; otherwise one would do better to stick with a noisier neoclassical model’.

Rabin is possibly the most sympathetic of the symposium authors, noting for example that ‘many of the ways humans are less than fully rational are not because the right answers are so complex. They are instead because the wrong answers are so enticing’ (2013, p. 529). Rabin does go on, however, to state that ‘care should be taken to investigate whether the new models improve insight on average… in my view, many new models and explanations for experimental findings look artificially good and artificially insightful in the very limited domain to which they are applied’ (2013, p. 536). …

… Misbehaving does not deal nearly as well with the arguments that in many situations agents will learn to be rational. The arguments in the Journal of Economic Literature symposium both encompass and generalise this problem for behavioural economics. The authors accept without question that in many circumstances deviations from rationality are observed. However, no guidelines, no heuristics, are offered as to the circumstances in which systematic deviations might be expected, and circumstances where the rational model is still appropriate. Further, the theoretical models developed to explain some of the empirical findings in behavioural economics are very particular to the area of investigation, and do not readily permit generalisation.

On applying behavioural economics to policy:

In the final part (Part VIII) he discusses a modest number of examples where the insights of behavioural economics seem to have helped policymakers. He is at pains to point out that he is not trying to ‘replace markets with bureaucrats’ (p. 307). He discusses at some length the term he coined with Sunstein, ‘libertarian paternalism’. …

We might perhaps reflect on why it is necessary to invent this term at all. The aim of any democratic government is to improve the lot of the citizens who have elected it to power. A government may attempt to make life better for everyone, for the interest groups who voted for it, for the young, for the old, or for whatever division of the electorate which we care to name. But to do so, it has to implement policies that will lead to outcomes which are different from those which would otherwise have happened. They may succeed, they may fail. They may have unintended consequences, for good or for ill. By definition, government acts in paternalist ways. By the use of the word ‘libertarian’, Thaler could be seen as trying to distance himself from the world of the central planner.

… And yet the suspicion remains that the central planning mind set lurks beneath the surface. On page 324, for example, Thaler writes that ‘in our increasingly complicated world, people cannot be expected to have the experience to make anything close to the optimal decisions in all the domains in which they are forced to choose’. The implication is that behavioural economics both knows what is optimal for people and can help them get closer to the optimum.

Further, we read that ‘[a] big picture question that begs for more thorough behavioral analysis is the best way to encourage people to start new businesses (especially those which might be successful)’ (p. 351). It is the phrase in brackets which is of interest. Very few people, we can readily conjecture, start new businesses in order for them to fail. But most new firms do exactly that. Failure rates are very high, especially in the first two or three years of life. How exactly would we know whether a start-up was likely to be successful? There is indeed a point from the so-called ‘Gauntlet’ of orthodox economics which is valid in this particular context. Anyone who had a good insight into which start-ups were likely to be successful would surely be extremely rich.

Unchanging humans

NormanOne interesting thread to Don Norman’s excellent The Design of Everyday Things is the idea that while our tools and technologies are subject to constant change, humans stay the same. The fundamental psychology of humans is a relative constant.

Evolutionary change to people is always taking place, but the pace of human evolutionary change is measured in thousands of years. Human cultures change somewhat more rapidly over periods measured in decades or centuries. Microcultures, such as the way by which teenagers differ from adults, can change in a generation. What this means is that although technology is continually introducing new means of doing things, people are resistant to changes in the way they do things.

I feel this is generally the right perspective to think about human interaction with technology. There are certainly biological changes to humans based on their life experience. Take the larger hippocampus of London taxi drivers, increasing height through industrialisation, or the Flynn effect. But the basic building blocks are relatively constant. The humans of today and twenty years ago are close to being the same.

Every time I hear arguments about changing humans (or any discussion of millennials, generation X and the like), I recall the following quote from Bill Bernbach (I think first pointed out to me by Rory Sutherland):

It took millions of years for man’s instincts to develop. It will take millions more for them to even vary. It is fashionable to talk about changing man. A communicator must be concerned with unchanging man, with his obsessive drive to survive, to be admired, to succeed, to love, to take care of his own.

(If I were making a similar statement, I’d use a shorter time period than “millions”, but I think Bernbach’s point still stands.)

But for how long will this hold? Don Norman again:

For many millennia, even though technology has undergone radical change, people have remained the same. Will this hold true in the future? What happens as we add more and more enhancements inside the human body? People with prosthetic limbs will be faster, stronger, and better runners or sports players than normal players. Implanted hearing devices and artificial lenses and corneas are already in use. Implanted memory and communication devices will mean that some people will have permanently enhanced reality, never lacking for information. Implanted computational devices could enhance thinking, problem-solving, and decision-making. People might become cyborgs: part biology, part artificial technology. In turn, machines will become more like people, with neural-like computational abilities and humanlike behavior. Moreover, new developments in biology might add to the list of artificial supplements, with genetic modification of people and biological processors and devices for machines.

I suspect much of this, at least in the short term, will only relate to some humans. The masses will experience these changes with some lag.

(See also my last post on the human-machine mix.)

Getting the right human-machine mix

NormanMuch of the storytelling about the future and humans and machines runs with a theme that machines will not replace us, but that we will work with machines to create a combination greater than either alone. If you have heard the freestyle chess example, which now seems to be everywhere, you will understand the idea. (See my article in Behavioral Scientist if you haven’t.)

An interesting angle to this relationship is just how unsuited some of our existing human-machine combinations are for the unique skills of a human brings. As Don Norman writes in his excellent The Design of Everyday Things:

People are flexible, versatile, and creative. Machines are rigid, precise, and relatively fixed in their operations. There is a mismatch between the two, one that can lead to enhanced capability if used properly. Think of an electronic calculator. It doesn’t do mathematics like a person, but can solve problems people can’t. Moreover, calculators do not make errors. So the human plus calculator is a perfect collaboration: we humans figure out what the important problems are and how to state them. Then we use calculators to compute the solutions.

Difficulties arise when we do not think of people and machines as collaborative systems, but assign whatever tasks can be automated to the machines and leave the rest to people. This ends up requiring people to behave in machine like fashion, in ways that differ from human capabilities. We expect people to monitor machines, which means keeping alert for long periods, something we are bad at. We require people to do repeated operations with the extreme precision and accuracy required by machines, again something we are not good at. When we divide up the machine and human components of a task in this way, we fail to take advantage of human strengths and capabilities but instead rely upon areas where we are genetically, biologically unsuited.

The result is that at the moments when we expect the humans to act, we have set them up for failure:

We design equipment that requires people to be fully alert and attentive for hours, or to remember archaic, confusing procedures even if they are only used infrequently, sometimes only once in a lifetime. We put people in boring environments with nothing to do for hours on end, until suddenly they must respond quickly and accurately. Or we subject them to complex, high-workload environments, where they are continually interrupted while having to do multiple tasks simultaneously. Then we wonder why there is failure.

And:

Automation keeps getting more and more capable. Automatic systems can take over tasks that used to be done by people, whether it is maintaining the proper temperature, automatically keeping an automobile within its assigned lane at the correct distance from the car in front, enabling airplanes to fly by themselves from takeoff to landing, or allowing ships to navigate by themselves. When the automation works, the tasks are usually done as well as or better than by people. Moreover, it saves people from the dull, dreary routine tasks, allowing more useful, productive use of time, reducing fatigue and error. But when the task gets too complex, automation tends to give up. This, of course, is precisely when it is needed the most. The paradox is that automation can take over the dull, dreary tasks, but fail with the complex ones.

When automation fails, it often does so without warning. … When the failure occurs, the human is “out of the loop.” This means that the person has not been paying much attention to the operation, and it takes time for the failure to be noticed and evaluated, and then to decide how to respond.

There is an increasing catalogue of these types of failures. Air France flight 447, which crashed into the Atlantic in 2009, is a classic case. The autopilot suddenly handed to the pilots an otherwise well-functioning plane due to an airspeed indicator problem, leading to disaster. But perhaps this new type of failure is an acceptable result of the overall improvement in system safety or performance.

This human-machine mismatch is also a theme in Charles Perrow’s Normal Accidents. Perrow notes that many systems are poorly suited to human psychology, with long periods of inactivity interspersed by bunched workload. The humans are often pulled into the loop just at the moments things are starting to go wrong. The question is not how much work humans can safely do, but how little.

Coursera’s Data Science Specialisation: A Review

As I mentioned in my comments on Coursera’s Executive Data Science specialisation, I have looked at a lot of online data science and statistics courses to find useful training material, understand the skills of people who have done these online courses, plus learn a bit myself.

One of the best known sets of courses is Coursera’s Data Science Specialisation, created by John Hopkins University. It is a ten course program that covers the data science process from data collection to the production of data science products. It focuses on implementing the data science process in R.

This specialisation is a signal that someone is familiar with data analysis in R – and the units are not bad if learning R is your goal. But this specialisation (nor any other similar length course I have reviewed to date) doesn’t offer a shortcut to the statistical knowledge necessary for good data science. A few university length units seem to be the minimum, and even they need to be paired with experience and self-directed study (not to mention some skepticism of what we can determine).

The specialisation assessments are such that you can often pass the courses without understanding what you have been taught. Points for some courses are awarded for “effort” (see Statistical Inference below). While capped at three attempts per 8 hours, the multiple choice quizzes have effectively unlimited attempts. I don’t have a great deal of faith in university assessment processes either – particularly in Australia where no-one wants to disrupt the flood of fees from international students by failing someone – but the assessment in these specialisations require even less knowledge or effort. They’re not much of a signal of anything.

If you are wondering whether you should audit or pay for the specialisation, you can’t submit the assignments under the audit option. But the quizzes are basic and you can find plenty of assignment submissions on GitHub or RPubs against which you can check your work.

Here are some notes on each course. I looked through each of these over a year or so, so there might be some updates to the earlier courses (although a quick revisit suggests my comments still apply).

  1. The Data Scientist’s Toolbox: Little more than an exercise in installing R and git, together with an overview of the other courses in the specialisation. If you are familiar with R and git, skip.
  1. R Programming: In some ways the specialisation could have been called R Programming. This unit is one of the better of the ten, and gives a basic grounding in R.
  1. Getting and Cleaning Data: Not bad for getting a grasp of the various ways of extracting data into R, but watching video after video of imports of different formats makes for less-than exciting viewing. The principles on tidy data are important – the unit is worth doing for this alone.
  1. Exploratory Data Analysis: Really a course in charting in R, but a decent one at that. There is some material on principal components analysis and clustering that will likely go over most people’s heads – too much material in too little time.
  1. Reproducible Research: The subject of this unit – literate (statistical) programming – is one of the more important subjects covered in the specialisation. However, this unit seemed cobbled together – lectures repeated points and didn’t seem produced to a logical structure. The last lecture is a conference video (albeit one worth watching). If you compare this unit to the (outstanding) production effort that has gone into the Applied Data Science with Python specialisation, this unit compares poorly.
  1. Statistical Inference: Likely too basic for someone with a decent stats background, but confusing for someone without. This unit hits home how it isn’t possible to build a stats background in a couple of hours a week over four weeks. The peer assessment caters to this through criteria such as “Here’s your opportunity to give this project +1 for effort.”, with option “Yes, this was a nice attempt (regardless of correctness)”.
  1. Regression Models: As per statistical inference, but possibly even more confusing for those without a stats background.
  1. Practical Machine Learning: Not a bad course for getting across implementing a few machine learning models in R, but there are better background courses. Start with Andrew Ng’s Machine Learning, and then work through Stanford’s Statistical Learning (which also has great R materials). Then return to this unit for a slightly different perspective. As for many of the other specialisation units, it is at a level too high for someone with no background. For instance, there is no point where they actually describe what machine learning is.
  1. Developing Data Products: This course is quite good, covering some of the major publishing tools, such as Shiny, R Markdown and Plotly (although skip the videos on Swirl). The strength of this specialisation is training in R, and that is what this unit focuses on.
  1. Data Science Capstone: This course can be best thought of as a commitment device that will force you to learn a certain amount about natural language processing in R (the topic of the project). You are given a task with a set of milestones, and you’re left to figure it out for yourself. Unless you already know something about natural language processing, you will have to review other courses and materials and spend a lot of time on the discussion boards to get yourself across the line. Skip it and do a natural language processing course such as Coursera’s Applied Text Mining in Python (although this assumes a fair bit of skill in Python). Besides, you can only access the capstone if you have paid for and completed the other nine units in the specialisation.

Charles Perrow’s Normal Accidents: Living with High-Risk Technologies

A typical story in Charles Perrow’s Normal Accidents: Living with High-Risk Technologies runs like this.

We start with a plant, airplane, ship, biology laboratory, or other setting with a lot of components (parts, procedures, operators). Then we need two or more failures among components that interact in some unexpected way. No one dreamed that when X failed, Y would also be out of order and the two failures would interact so as to both start a fire and silence the fire alarm. Furthermore, no one can figure out the interaction at the time and thus know what to do. The problem is just something that never occurred to the designers. Next time they will put in an extra alarm system and a fire suppressor, but who knows, that might just allow three more unexpected interactions among inevitable failures. This interacting tendency is a characteristic of a system, not of a part or an operator; we will call it the “interactive complexity” of the system.

For some systems that have this kind of complexity, … the accident will not spread and be serious because there is a lot of slack available, and time to spare, and other ways to get things done. But suppose the system is also “tightly coupled,” that is, processes happen very fast and can’t be turned off, the failed parts cannot be isolated from other parts, or there is no other way to keep the production going safely. Then recovery from the initial disturbance is not possible; it will spread quickly and irretrievably for at least some time. Indeed, operator action or the safety systems may make it worse, since for a time it is not known what the problem really is.

Take this example:

A commercial airplane … was flying at 35,000 feet over Iowa at night when a cabin fire broke out. It was caused by chafing on a bundle of wire. Normally this would cause nothing worse than a short between two wires whose insulations rubbed off, and there are fuses to take care of that. But it just so happened that the chafing took place where the wire bundle passed behind a coffee maker, in the service area in which the attendants have meals and drinks stored. One of the wires shorted to the coffee maker, introducing a much larger current into the system, enough to burn the material that wrapped the whole bundle of wires, burning the insulation off several of the wires. Multiple shorts occurred in the wires. This should have triggered a remote-control circuit breaker in the aft luggage compartment, where some of these wires terminated. However, the circuit breaker inexplicably did not operate, even though in subsequent tests it was found to be functional. … The wiring contained communication wiring and “accessory distribution wiring” that went to the cockpit.

As a result:

Warning lights did not come on, and no circuit breaker opened. The fire was extinguished but reignited twice during the descent and landing. Because fuel could not be dumped, an overweight (21,000 pounds), night, emergency landing was accomplished. Landing flaps and thrust reversing were unavailable, the antiskid was inoperative, and because heavy breaking was used, the brakes caught fire and subsequently failed. As a result, the aircraft overran the runway and stopped beyond the end where the passengers and crew disembarked.

As Perrow notes, there is nothing complicated in putting a coffee maker on a commercial aircraft. But in a complex interactive system, simple additions can have large consequences.

Accidents of this type in complex, tightly coupled systems are what Perrow calls a “normal accident”. When Perrow uses the word “normal”, he does not mean these accidents are expected or predictable. Many of these accidents are baffling. Rather, it is an inherent property of the system to experience an interaction of this kind from time to time.

While it is fashionable to talk of culture as a solution to organisational failures, in complex and tightly coupled systems even the best culture is not enough. There is no improvement to culture, organisation or management that will eliminate the risk. That we continue to have accidents in industries with mature processes, good management and decent incentives not to blow up suggests there might be something intrinsic about the system behind these accidents.

Perrow’s message on how we should deal with systems prone to normal accidents is that we should stop trying to fix them in ways that only make them riskier. Adding more complexity is unlikely to work. We should focus instead on reducing the potential for catastrophe when there is failure.

In some cases, Perrow argues that the potential scale of the catastrophe is such that the systems should be banned. He argues nuclear weapons and nuclear energy are both out on this count. In other systems, the benefit is such that we should continue tinkering to reduce the chance of accidents, but accept they will occur despite our best efforts.

One possible approach to complex, tightly coupled systems is to reduce the coupling, although Perrow does not dwell deeply on this. He suggests that the aviation industry has done this to an extent through measures such as corridors that exclude certain types of flights. But in most of the systems he examines, decoupling appears difficult.

Despite Perrow’s thesis being that accidents are normal in some systems, and that no organisational improvement will eliminate them, he dedicates a considerable effort to critiquing management error, production pressures and general incompetence. The book could have been half the length with a more focused approach, but it does suggest that despite the inability to eliminate normal accidents, many complex, tightly coupled systems could be made safer through better incentives, competent management and the like.

Other interesting threads:

  • Normal Accidents was published in 1984, but the edition I read had an afterword written in 1999 in which Perrow examined new domains to which normal accident theory might be applied. Foreshadowing how I first came across the concept, he points to financial markets as a new domain for application. I first heard of “normal accidents” in Tim Harford’s discussion financial markets in Adapt. Perrow’s analysis of the upcoming Y2K bug under his framework seems slightly overblown in hindsight.
  • The maritime accident chapter introduced (to me) the concepts of radar assisted collisions and non collision course collisions. Radar assisted collisions are a great example of the Peltzman effect, whereby vessels that would have once remained stationary or crawled through fog now speed through. The first vessels with radar were comforted that they could see all the stationary or slow-moving obstacles as dots on their radar screen. But as the number of vessels with radars increased and those other dots also start moving with speed, we have more radar assisted collisions. On non collision course collisions, Perrow notes that most collisions involve two (or more) ships that were not on a collision course, but on becoming aware of each other managed to change course to effect a collision. Coordination failures are rife.
  • Perrow argues that nuclear weapon systems are so complex and prone to failure that there is inherent protection against catastrophic accident. Not enough pieces are likely to work to give us the catastrophe. Of course, this gives reason for concern about whether they will work when we actually need them (again, maybe a positive). Perrow even asks if complexity and coupling can be so problematic that the system ceases to exist.
  • Perrow spends some time critiquing hindsight bias in assessing accidents. He gives one example of a Union Carbide plant that received a glowing report from a US government department. Following an accidental gas release some months later, that same government department described the plant as accident waiting to happen. I recommend Phil Rosenzweig’s The Halo Effect for a great analysis of this problem in assessing the factors behind business performance after the fact.

The benefit of doing nothing

From Tim Harford:

[I]n many areas of life we demand action when inaction would serve us better.

The most obvious example is in finance, where too many retail investors trade far too often. One study, by Brad Barber and Terrance Odean, found that the more retail investors traded, the further behind the market they lagged: active traders underperformed by more than 6 percentage points (a third of total returns) while the laziest investors enjoyed the best performance.

This is because dormant investors not only save on trading costs but avoid ill-timed moves. Another study, by Ilia Dichev, noted a distinct tendency for retail investors to pile in when stocks were riding high and to sell out at low points. …

The same can be said of medicine. It is a little unfair on doctors to point out that when they go on strike, the death rate falls. Nevertheless it is true. It is also true that we often encourage doctors to act when they should not. In the US, doctors tend to be financially rewarded for hyperactivity; everywhere, pressure comes from anxious patients. Wiser doctors resist the temptation to intervene when there is little to be gained from doing so — but it would be better if the temptation was not there. …

Harford also reflects on the competition between humans and computers, covering similar territory to that in my Behavioral Scientist article Don’t Touch the Computer (even referencing the same joke).

The argument for passivity has been strengthened by the rise of computers, which are now better than us at making all sorts of decisions. We have been resisting this conclusion for 63 years, since the psychologist Paul Meehl published Clinical vs. Statistical Prediction. Meehl later dubbed it “my disturbing little book”: it was an investigation of whether the informal judgments of experts could outperform straightforward statistical predictions on matters such as whether a felon would violate parole.

The experts almost always lost, and the algorithms are a lot cleverer these days than in 1954. It is unnerving how often we are better off without humans in charge. (Cue the old joke about the ideal co-pilot: a dog whose job is to bite the pilot if he touches the controls.)

The full article is here.