Tetlock and Gardner’s Superforecasting: The Art and Science of Prediction
Philip Tetlock and Dan Gardner’s Superforecasting: The Art and Science of Prediction doesn’t quite measure up to Tetlock’s superb Expert Political Judgment (read EPJ first), but it contains more than enough interesting material to make it worth the read.
The book emerged from a tournament conducted by the Intelligence Advanced Research Projects Activity (IARPA), designed to pit teams of forecasters against each other in predicting political and economic events. These teams included Tetlock’s Good Judgment Project (also run by Barbara Mellers and Don Moore), a team from George Mason University (for which I was a limited participant), and teams from MIT and the University of Michigan.
The result of the tournament was such a decisive victory by the Good Judgment Project during the first 2 years that IARPA dropped the other teams for later years. (It wasn’t a completely open fight - prediction markets could not use real money. Still, Tetlock concedes that the money-free prediction markets did pretty well, and there is scope to test them further in the future.)
Tetlock’s formula for a successful team is fairly simple. Get lots of forecasts, calculate the average of the forecast, and give extra weight to the top forecasters - a version of wisdom of the crowds. Then extremize the forecast. If the forecast is a 70% probability, bump up to 85%. If 30%, cut it to 15%.
The idea behind extremising is quite clever. No one in the group has access to all the dispersed information. If everyone had all the available information, this would tend to raise their confidence, which would result in a more extreme forecast. Since we can’t give everyone all the information, extremising is an attempt to simulate what would happen if you did. To get the benefits of this extremising, however, requires diversity. If everyone holds the same information there is no sharing of information to be simulated.
But the book is not so much about why the Good Judgment Project was superior to the other teams. Mostly it is about the characteristics of the top 2% of the Good Judgment Project forecasters - a group that Tetlock calls superforecasters.
Importantly, “superforecaster” is not a label given on the basis of blind luck. The correlation in forecasting accuracy for Good Judgment Project members between one year and next was around 0.65. 70% of superforecasters stay in the top 2% the following year.
Some of the characteristics of superforecasters are to be expected. Whereas the average Good Judgment participant scored better than 70% of the population on IQ, superforecasters were better than about 80%. They were smarter, but not markedly so.
Tetlock argues much of the differences lies in technique, and this is where he focused. When faced with a complex question, superforecasters tended to first break it into manageable pieces. For the question of whether French or Swiss inquiries would discover elevated levels of polonium in Yasser Arafat’s body (had he been poisoned?), they might ask whether polonium (which decays) could be found in a man dead for years, what ways could polonium have made its way into his body etc. They don’t jump straight to the implied question of whether Israel poisoned Arafat (which the question was technically not about).
Superforecasters also tended to take the outside view for each of these sub-questions. What is the base rate of this event? (Not so easy for this Arafat question) It is only then that they take the “inside view” by looking for information idiosyncratic to that particular question.
The most surprising finding (to me) was that superforecasters were highly granular in their probability forecasts and granularity predicts accuracy. People who stick to tens (10%, 20%, 30% etc) are less accurate than those who stick to fives (5%, 10%, 15% etc), who are less accurate than those who use ones (35%, 36%, 37% etc). Rounding superforecaster estimates reduces their accuracy, although this has little effect on regular forecasters. A superforecaster will distinguish between 63% and 65%, and this makes them more accurate.
Partly this granularity is reflected in the updates they make when new information is obtained (although they are also more accurate on their initial estimate). Being a superforecaster requires monitoring the news, and reacting the right amount. There are occasional big updates - which Tetlock suggests superforecasters can make because they are not tied to their forecasts like a professional pundit - but most of the time the tweaks represent an iteration toward an answer.
Tetlock suggests such fine-grained distinctions would not come to people naturally, as making them would not have been evolutionarily favourable. If there is a lion in grass, there are three likely responses - yes, no, maybe - not 100 shades of grey. But the reality is there needs to be a threshold for each, and evolution can act on fine distinctions. A gene that leads people to apply “run” with 1% greater accuracy over many generations will spread.
Superforecasters also suffer less from scope insensitivity. People will pay roughly the same amount to save 2,000 or 200,000 migrating birds. Similarly, when asked whether an event will occur in the next 6 or 12 months, regular forecasters would predict approximately the same probability of the event occurring. Conversely, superforecasters tend to spot the difference in timeframes and adjust their probabilities so, although they did not exhibit perfect scope insensitivity. I expect an explicit examination of base rates would help in reducing that scope insensitivity as it will tend to relate to a timeframe.
A couple of the characteristics Tetlock gives to the superforecasters seem a bit fluffy. Tetlock describes them as having a “growth mindset”, although the evidence presented simply suggests that they work hard and try to improve.
Similarly, Tetlock labels the superforecasters as having “grit”. I’ll just call them conscientious.
Beyond the characteristics of superforecasters, Tetlock revisits a couple of themes from Expert Political Judgment. As a start, there is a need to apply numbers to forecasts, or else they are fluff. Tetlock relates the story of Sharman Kent asking intelligence officers what they took the words “serious possibility” in a National Intelligence estimate to mean (this wording relating to the possibility of a Soviet invasion of Yugoslavia in 1951). The answer turned out to be anything between a 20% and an 80% probability.
Then there is a need for scoring against appropriate benchmarks - such as no change or the base rate. As Tetlock points out, lauding Nate Silver for picking 50 of 50 states in the 2012 Presidential election is a “tad overwrought” if compared to the no-change prediction of 48.
One contrast with the private Expert Political Judgment project was that forecasters in the public IARPA tournament were better calibrated. While the nature of the questions may have been a factor - the tournament questions related to shorter timeframes to allow the tournament to deliver results in a useful time - Tetlock suggests that publicity creates a form of accountability. There was also less difference between foxes and hedgehogs in the public environment.
One interesting point buried in the notes is where Tetlock acknowledges the various schools of thought around how accurate people are, such as the work by Gerd Gigerenzer and friends on the accuracy of our gut instincts and simple heuristics. Without going into a lot of detail, Tetlock declares the “heuristics and biases” program is the best approach to bring error rates in forecasting down. The short training guidelines - contained in the Appendix to the book and targeted to typical biases - improved accuracy by 10%. While Tetlock doesn’t really put his claim to the test by comparing all approaches (What would a Gigerenzer led team do?), the evidence of the success of the Good Judgment team makes it hard, at least for the moment, to argue with.