DECEMBER 21, 2012
The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t
/* Style Definitions */
mso-padding-alt:0in 5.4pt 0in 5.4pt;
font-family:"Times New Roman";
By Nate Silver
(Penguin, 534 pp., $27.95)
IN THE HISTORY OF election forecasting, 2012 was 1936 all over again, with the roles updated. In 1936, a trio of new forecasters—Elmo Roper, Archibald Crossley, and George Gallup—used statistical sampling methods and predicted that Franklin D. Roosevelt would win re-election, contrary to the best-known “straw poll” of the time, conducted by The Literary Digest. The Digest mailed ballots to millions of automobile owners and telephone subscribers, groups that in the midst of the Depression drastically over-represented Republicans. The sheer number of ballots it received, the Digest assumed, would guarantee an accurate forecast. In fact, its straw poll turned out to overestimate the vote for Alfred Landon, the Republican candidate for president, by nineteen points. That the new statistical forecasters could make a more accurate prediction on the basis of a small sample of voters came as a revelation to the public. Ever since, the sample survey has been the dominant means of measuring public opinion.
In 2012, a new generation of quants—mathematically minded polling aggregators—had their moment of public vindication. During the campaign’s final weeks, the national polls individually indicated a tight race, with Gallup showing a significant margin for Mitt Romney among “likely” voters. Political reporters in the major news media treated the election as a toss-up, while conservatives such as Karl Rove claimed that Romney had momentum and would emerge with a victory. The quants—including Nate Silver, the founder of FiveThirtyEight, a blog at The New York Times, and Sam Wang, a professor of neuroscience at Princeton, who runs a site called the Princeton Election Consortium—were having none of it. On the basis of data from a large number of surveys (especially state polls), they insisted that Barack Obama was an overwhelming favorite, a conclusion that outraged some conservatives who singled out Silver for denunciation and turned the election into a methodological showdown. The aggregators were only estimating probabilities, and even if Romney had won, their models could have been correct. But Obama’s victory, like FDR’s in 1936, became a validating event for a more sophisticated method of making election forecasts.
With his prominence at the Times, Silver embodies the latest triumph of the nerds, but his new book on prediction is more skeptical, nuanced, and interesting than that image suggests. Anyone who reads Silver’s political analysis knows that his gift is as much with words as with numbers; he writes about complex analytical problems with exceptional clarity. He made his reputation originally with a statistical model for predicting baseball players’ performances. The surprise in Silver’s book is how widely he ranges beyond sports and politics, artfully drawing on his own journalism and personal experience as well as the lessons of statistical modeling to tell a story that is as entertaining as it is instructive about the state of human powers of prediction.
The true state of those powers, according to Silver, is far less advanced than many of the practitioners of the dark arts of forecasting acknowledge. In the mid-twentieth century, the advent of new analytical techniques in the social and natural sciences led to high expectations for progress in prediction: by drawing on increasingly plentiful sources of data, forecasters would presumably become more accurate in projecting the business cycle, changes in population, and rates of technological advance. More recently, the vast rivers of information created by electronic communication and commerce have fed another round of rising expectations of more accurate forecasting. Thanks to the arrival of Big Data, according to Chris Anderson, the editor of Wired, waterfalls of information will answer our needs, and we’ll be able to dispense with scientific theory.
Silver rejects this view. In his book he set out to survey “data-driven predictions” in a variety of disciplines, but discovered that “prediction in the era of Big Data was not going very well.” Contrary to Anderson, he insists that theory is vital: “We think we want information when we really want knowledge.” The moral of his book is not that the statistical models are always right. The story that Silver tells is that we often get better predictions by supplementing the models with judgments that incorporate more information. So don’t check the lessons of your experience at the door. And beware of geeks bearing models that have no underlying causal logic.
FEW READERS will be familiar with all the stops on Silver’s tour of the prediction disciplines. He visits specialized fields where experts wrestle with forecasting the weather, earthquakes, epidemics, and the economy, and moves from the world of gambling and games (sports betting, poker, and chess) to realms where prediction is truly high stakes (the financial crisis, climate change, and terrorism). But in all these explorations Silver has a consistent theme. In making predictions, people need to think probabilistically, drawing on a wide range of evidence, continually asking hard questions, revising their estimates, recognizing the value of aggregated forecasts, and probing to see whether they may have mistakenly ruled out possibilities that were unfamiliar or arbitrarily left out of their research.
The book is partly a catalogue of the sources of forecasting overreach. Some of these sources, Silver suggests, lie in human psychology. “Our brains, wired to detect patterns, are always looking for a signal,” he writes, “when instead we should appreciate how noisy the data is [sic].” This bias leads us to see patterns where there may be none and contributes to overconfidence in predictions. Silver suggests that rather than being immune to these problems, some experts are especially prone to them. In a twenty-year study, for example, Philip Tetlock, a professor of psychology at the University of Pennsylvania, found that political experts did only slightly better than chance in predicting political events. And among those experts, “hedgehogs,” who had one big idea, did significantly worse than “foxes,” who used multiple approaches.
While the hedgehogs’ mistakes were owed to overly encompassing theories, there is a related error, known as “overfitting,” which Silver calls “the most important scientific problem you’ve never heard of.” Working from historical data, forecasters may develop an elaborate model that seems to account for every wiggle in a curve. But that model may be, as Silver says, an “overly specific solution to a general problem,” with little or no value in making predictions.
The absence of incentives for accuracy may also explain some persistent errors. Every week, the panelists on TV’s long-running shouting match, The McLaughlin Group, make their predictions. Silver evaluated nearly a thousand of those predictions, and found that the pundits might just as well have been tossing a coin: they got things right just about as often as they got them wrong. But television shows survive on the basis of their ratings, not the accuracy of their forecasts. Skewed incentives have also contributed to more serious failures of prediction. One of the critical steps leading to the financial crisis of 2008 was the failure of ratings agencies to predict the risk of mortgage-backed securities. The agencies later pleaded that it was impossible to know the risks, but others at the time identified them. The trouble was that the agencies had no incentive to recognize the risks; they would have made less money if they did.
The financial crisis, as Silver conceives it, was a chain of prediction failures, from homeowners making mistaken predictions about rising housing prices to the economists in the Obama administration initially making mistaken predictions about the depth and the duration of the recession. Though there were some exceptions, economic forecasters generally failed to see the crisis coming. Following the lead of one forecaster who got it right, Jan Hatzius of Goldman Sachs, Silver focuses on the limits in understanding causal relations in the economy, the difficulties in grasping economic change, and the poor quality of much of the data that go into forecasts.
Another source of overreach is that a discipline may simply not have developed to the point where it can provide the basis for accurate predictions. The history of earthquake prediction is littered with failures. Although scientists can make forecasts about the probability that a quake will hit an area over a period of years or centuries, they do not have the ability to predict exactly when and where an earthquake will strike. Nor is there any sign that the field is close to that goal.
Weather forecasting offers Silver an instructive contrasting case. Although the weather system is extraordinarily complex, the basic science is well-understood, and the accuracy of forecasts has improved significantly in recent decades as a result of increased computer power and more sophisticated models. Weather forecasting also illustrates how human judgment can either improve or worsen the accuracy of forecasts, depending on an organization’s incentives. At the National Weather Service—one of the least appreciated services that the federal government provides—meteorologists make use of both computational power and their own judgment to arrive at forecasts. The meteorologists’ judgments, according to data from the agency, result in precipitation forecasts that are about 25 percent more accurate, and temperature forecasts that are about 10 percent more accurate, than the computer results alone. Improved forecasts have real human benefit; the National Hurricane Center can now do what it was unable to do twentyfive years ago: make predictions three days in advance of where a hurricane will make landfall that are accurate enough to enable people to evacuate in time.
But some weather forecasts are less accurate because of judgments by forecasters who have other motivations. Most Americans do not get weather forecasts directly from the National Weather Service; they receive them instead from commercial services such as AccuWeather and the Weather Channel, and also from local TV stations, which tailor the government data to their own needs. According to Silver, the forecasts of the National Weather Service are well-calibrated: when it forecasts a 40 percent chance of rain, it actually does rain 40 percent of the time. The Weather Channel, in contrast, has a slight “wet bias.” When there is a 5 percent chance of rain, it says that the odds are 20 percent because people are angry at a forecaster when they leave home without an umbrella and get soaked, whereas they are delighted when they take an umbrella and the sun shines. But it’s at the local level that forecasts get seriously distorted. According to a study of Kansas City stations, the local weathermen provide much worse forecasts than the National Weather Service and are unashamed about it. (“Accuracy is not a big deal,” one of them said.) Again, television does not reward accuracy because ratings come first.
The role of judgment in forecasts also comes up in Silver’s discussion of the field where he initially made his reputation. Professional baseball scouts, he now concludes, are more like the National Weather Service’s meteorologists than the local weatherman: the scouts’ judgments contribute to more accurate evaluations of baseball players. A decade ago, when Michael Lewis wrote Moneyball, statheads and scouts eyed each other with suspicion verging on contempt. That was the world of baseball that Silver entered a few years out of college when he turned a childhood interest (“as an annoying little math prodigy, I was attracted to all the numbers in the game”) into a professional career. The statistical model that he developed, called pecota, was not the first to project players’ performances, but he says it had its advantages for a while, until others caught up with its innovations. Revisiting his old predictions of how well minor-league prospects would do in the major leagues, he finds that his system did not perform as well as the more traditional list produced by Baseball America. The old Moneyball battle is over. Supplement a statistical model with trained judgment, and the result is improved predictions.
Silver’s acceptance of judgment in baseball and weather forecasts reflects his more general outlook. The hero of his book is Thomas Bayes, the eighteenth-century English author of a classic theorem showing how to revise a prior estimate of the probability of an event on the basis of additional information. The difficulty that some statisticians have with a Bayesian approach is that it demands an exercise of judgment in establishing “priors,” whereas the dominant “frequentist” approach in statistics seems more objective because judgment plays no role in it. Silver’s discussion of these issues is rather one-sided, and he veers into a personal attack on R.A. Fisher, one of the central figures in the frequentist tradition. But he is right to argue that the demand for judgment as a point of departure in Bayesian thinking has a real merit in forcing the analyst to take background and context into account.
The Bayesian approach, Silver writes, “encourages us to hold a large number of hypotheses in our head at once, to think about them probabilistically, and to update them frequently when we come across new information that might be more or less consistent with them.” The foxes in Tetlock’s study do this on their own; aggregating the judgments of different forecasters is another way of getting at the same objective. Despite all the problems with economic forecasts, for example, aggregated judgments are more accurate than individual ones. According to Silver, the Survey of Professional Forecasters—a quarterly poll produced by the Federal Reserve Bank of Philadelphia—is “about 20 percent more accurate than the typical individual’s forecast at predicting GDP, 10 percent better at predicting unemployment, and 30 percent better at predicting inflation.”
When Silver turned to election forecasting, he made use of that insight. Aggregating the results of election polls is a way not just of increasing the sample size but also of aggregating the judgments that went into the polling. Yet not every field can profit from that strategy; aggregating earthquake predictions would not do any good. Fortunately, predicting elections is more like predicting the weather than predicting earthquakes. The basic science seems reasonably well-understood. Although elections do not require models that are nearly as complex as the weather, these are two of the fields where forecasting has made substantial progress.
But as Silver’s book highlights, that is not true generally of the forecasting fields, where overconfidence and overreach are the more common pattern. And there lies much of our problem with the uses of forecasting. Many predictions carry more weight than they deserve. In the struggle over the federal budget, we are officially bound to specific—and often arbitrary—numbers produced by economists and statisticians at the Congressional Budget Office. The markets respond nervously to the day’s economic forecasts. One advantage in reading Silver is that if the latest numbers are keeping you up at night, his work may calm your nerves. Your prior judgment may have more value than you realize.
Paul Starr is professor of sociology and public affairs at Princeton University and the author, most recently, of Remedy and Reaction: The Peculiar American Struggle over Health Care Reform (Yale University Press). This article appeared in the December 31, 2012 issue of the magazine under the headline “Tomorrow Today.”