Tetlock’s Expert Political Judgment: How Good Is It? How Can We Know?

Author

Jason Collins

Published

August 25, 2016

A common summary of Philip Tetlock’s Expert Political Judgment: How Good Is It? How Can We Know? is that “experts” are terrible forecasters. There is some truth in that summary, but I took a few different lessons from the book. While experts are bad, others are worse. Simple algorithms and more complex models outperform experts. And importantly, forecasting itself is not a completely pointless task.

Tetlock’s book reports on what must be one of the grander undertakings in social science. Cushioned by his recently gained tenure, Tetlock asked a range of experts to predict future events. With the need to see how the forecasts panned out, the project ran for almost 20 years.

The basic methodology was to ask each participant to rate three possible outcomes for a political or economic event on a scale of 0 to 10 on how likely each outcome is (with, assuming some basic mathematical literacy, the sum allocated to the three options being 10). An example questions might be whether a government will retain, lose or strengthen its position after the next election. Or whether GDP growth will be below 1.75 per cent, between 1.75 per cent and 3.25 per cent, or above 3.25 per cent.

Once the results were in, Tetlock scored the participants on two dimensions - calibration and discrimination. To get a high calibration score, the frequency with which events are predicted needs to correspond with their actual frequency. For instance, events predicted to occur with a 10 per cent probability need to occur around 10 per cent of the time, and so on. Given experts made many judgments, these types of calculations could be made.

To score highly on discrimination, the participant needs to assign a score of 1.0 to things that happen and 0 to things that don’t. The closer to the ends of the scale for predictions, the higher the discrimination score. It is possible to be perfectly calibrated but a poor discriminator (fence sitter) through to a perfect discriminator (only using the extreme values correctly).

From Tetlock’s analysis of these scores come the headline findings of the book. I take them as:

As Bryan Caplan argues, Tetlock gives experts a harder time than they might deserve. The “chimps” are helped by a combination of hard questions and constrained answer fields. There is no option to predict one million per cent growth in GDP next year. We might expect experts to shine more if there were “dumb” questions. Further, the mentions of the horrible performance of the Berkeley undergrads, the proxy for unsophisticats, are rare. On the flipside, a baseline for assessment should not be the chimp or these undergrads, but the simple extrapolation algorithms - and there experts measure poorly.

The expected behaviour of the experts may provide a partial defence. They are filling out a survey, and are unlikely generate a model for every question. Many judgements were likely off the top of the head, with no serious stakes (including no public shaming). This does, however, raise the question of why they were so hopeless in their own fields of expertise where they might have some of these models available.

So what is it about foxes and hedgehogs that leads to differences in performance?

As a start, the approach of foxes lines up with the existing literature on forecasting. This literature shows that average predictions of forecasters are generally more accurate than the majority of forecasters for whom the averages are computed, trimming outliers further enhances accuracy, and there is opportunity for further improvement through the Delphi technique. In line with this, Tetlock suggests foxes factor in conflicting considerations in a flexible weighted-averaging fashion into their judgements.

Next, foxes are better Bayesians in that they update their beliefs in response to new evidence and in proportion to the extremity of the odds they placed on possible outcomes. They weren’t perfect Bayesian’s however - when surprised by a result, Tetlock calculated that foxes moved around 59 per cent of the prescribed amount compared to 19 per cent for hedgehogs. In some of the exercises, hedgehogs moved in the opposite direction.

There was a lot of evidence that both foxes and hedgehogs were more egocentric than natural Bayesians. A natural Bayesian would consider the probability of the event occurring if their view of the world is correct (which also has a probability attached to it) and the probability of the event occurring if their understanding of the world was wrong. But few spontaneously factored other views into their assessment of probabilities. When Tetlock broke down his experts’ predictions, the odds were almost always calculated based on their interpretation of the world being correct.

Foxes were also less prone to hindsight effects. Many experts claimed that they assigned higher probabilities to outcomes that materialised than they did. As Tetlock notes, it is hard to say someone got it wrong if they think they got it right. (Is hindsight bias, as suggested by one hedgehog, an adaptive mechanism that unclutters the mind?)

The chapter of the book where the hedgehogs wheel out the defences against their poor performance is somewhat amusing. As Tetlock points out, forecasters who thought they were good at the beginning sounded like radical skeptics about the value of forecasting by the end.

The experts commonly pointed out that their prediction was a near miss, so the result shouldn’t be held against them. But almost no-one said don’t hold the non-occurrence of event against others who predicted it.

They also tended to claim that “I made the right mistake”, as it is better to be safe than sorry. But all of Tetlock’s attempts to adjust the scoring to help hedgehogs in these cases failed to close the gap.

Some hedgehogs claimed that the questions were not over a long enough time period. There are irreversible trends at work in world today, and while specific events might be hard to predict, the shape of the world in the long-term is clear. But the problem is that hedgehogs were ideologically diverse, and only a few could be right about any long-term trends that exist.

One thing that might be said in favour of the hedgehogs is that the accuracy of the average of hedgehog forecasts was similar to the average of fox forecasts. The average fox forecast beats about 70% of foxes, but the average hedgehog forecast beats 95% of hedgehogs. The hedgehogs benefit in that the more extreme mistakes are balanced out. The result is that a team of hedgehogs might curtail each other’s excesses.

A better angle of defence is that the real goal of forecasting is political impact or reputation, where only the confident survive. Hedgehogs are also good at avoiding distraction in high noise environments, which becomes apparent when examining the major weakness of foxes.

Tetlock put some of his experts through a scenario exercise. In this exercise, the high level forecasts were branched into a large number of sub-scenarios, for which probabilities had to be allocated to each. For example, when given the question of whether Canada would break up (this was around a time of the Quebec separatist referendum), combinations of outcomes involving separatist party success at elections, referendum results, economic downturns and levels of acrimony were presented, rather than the simple question of whether Quebec would succeed or not.

As has been show in the behavioural literature, when this type of task is undertaken, the likelihood of the components often sums to more than one. For the Quebec question, the initial probabilities added up to 1.0 for the basic question - as expected - but to an average of 1.58 for the branched scenarios. Foxes, however, suffered the most in this exercise, producing estimates that summed to 2.09.

To constrain this problem, it is common to end the branching exercise with a requirement to adjust the probabilities such that they add to one. But the foxes tended not to end up where they started for the simple question, with the branching followed by adjustment reducing their forecasting accuracy down to the level of hedgehogs.

Given the net result of the scenario exercise was to confuse foxes and fail to open the mind of hedgehogs, it could be suggested to be a low value exercise. For people advocating scenario development, pre-mortems and red teaming, the possibly deleterious effects on some forecasters needs to be considered.

In sum, it’s a grand book. There are some points where deeper analysis would have been handy - such as when he suggests there is disagreement from “psychologists who subscribe to the argument that fast-and-frugal heuristics-simple rules of thumb-perform as well as, or better than, more complex, effort demanding algorithms” without actually examining whether they are at odds with his findings of the forecasting superiority of foxes. But that’s a small niggle in a fine piece of work.