In contrast to less-is-more claims, ignoring information is rarely, if ever optimal

Author

Jason Collins

Published

December 20, 2018

From the abstract of an interesting paper Heuristics as Bayesian inference under extreme priors by Paula Parpart and colleagues:

Simple heuristics are often regarded as tractable decision strategies because they ignore a great deal of information in the input data. One puzzle is why heuristics can outperform full-information models, such as linear regression, which make full use of the available information. These “less-is-more” effects, in which a relatively simpler model outperforms a more complex model, are prevalent throughout cognitive science, and are frequently argued to demonstrate an inherent advantage of simplifying computation or ignoring information. In contrast, we show at the computational level (where algorithmic restrictions are set aside) that it is never optimal to discard information. Through a formal Bayesian analysis, we prove that popular heuristics, such as tallying and take-the-best, are formally equivalent to Bayesian inference under the limit of infinitely strong priors. Varying the strength of the prior yields a continuum of Bayesian models with the heuristics at one end and ordinary regression at the other. Critically, intermediate models perform better across all our simulations, suggesting that down-weighting information with the appropriate prior is preferable to entirely ignoring it. Rather than because of their simplicity, our analyses suggest heuristics perform well because they implement strong priors that approximate the actual structure of the environment.

The following excerpts from the paper (minus references) help give more context to this argument. First, what is meant by a simple heuristic as opposed to a full-information model?

Many real-world prediction problems involve binary classification based on available information, such as predicting whether Germany or England will win a soccer match based on the teams’ statistics. A relatively simple decision procedure would use a rule to combine available information (i.e., cues), such as the teams’ league position, the result of the last game between Germany and England, which team has scored more goals recently, and which team is home versus away. One such decision procedure, the tallying heuristic, simply checks which team is better on each cue and chooses the team that has more cues in its favor, ignoring any possible differences among cues in magnitude or predictive value. … Another algorithm, take-the-best (TTB), would base the decision on the best single cue that differentiates the two options. TTB works by ranking the cues according to their cue validity (i.e., predictive value), then sequentially proceeding from the most valid to least valid until a cue is found that favors one team over the other. Thus TTB terminates at the first discriminative cue, discarding all remaining cues.

In contrast to these heuristic algorithms, a full-information model such as linear regression would make use of all the cues, their magnitudes, their predictive values, and observed covariation among them. For example, league position and number of goals scored are highly correlated, and this correlation influences the weights obtained from a regression model.

So why might less be more?

Heuristics have a long history of study in cognitive science, where they are often viewed as more psychologically plausible than full-information models, because ignoring data makes the calculation easier and thus may be more compatible with inherent cognitive limitations. This view suggests that heuristics should underperform full-information models, with the loss in performance compensated by reduced computational cost. This prediction is challenged by observations of less-is-more effects, wherein heuristics sometimes outperform full-information models, such as linear regression, in real-world prediction tasks. These findings have been used to argue that ignoring information can actually improve performance, even in the absence of processing limitations. … Gigerenzer and Brighton (2009) conclude, “A less-is-more effect … means that minds would not gain anything from relying on complex strategies, even if direct costs and opportunity costs were zero”.

Less-is-more arguments also arise in other domains of cognitive science, such as in claims that learning is more successful when processing capacity is (at least initially) restricted.

The current explanation for less-is-more effects in the heuristics literature is based on the bias-variance dilemma. … From a statistical perspective, every model, including heuristics, has an inductive bias, which makes it best-suited to certain learning problems. A model’s bias and the training data are responsible for what the model learns. In addition to differing in bias, models can also differ in how sensitive they are to sampling variability in the training data, which is reflected in the variance of the model’s parameters after training (i.e., across different training samples).

A core tool in machine learning and psychology for evaluating the performance of learning models, cross-validation, assesses how well a model can apply what it has learned from past experiences (i.e., the training data) to novel test cases. From a psychological standpoint, a model’s cross-validation performance can be understood as its ability to generalize from past experience to guide future behavior. How well a model classifies test cases in cross-validation is jointly determined by its bias and variance. Higher flexibility can in fact hurt performance because it makes the model more sensitive to the idiosyncrasies of the training sample. This phenomenon, commonly referred to as overfitting, is characterized by high performance on experienced cases from the training sample but poor performance on novel test items. …

Bias and variance tend to trade off with one another such that models with low bias suffer from high variance and vice versa. With small training samples, more flexible (i.e., less biased) models will overfit and can be bested by simpler (i.e., more biased) models such as heuristics. As the size of the training sample increases, variance becomes less influential and the advantage shifts to the complex models.

So what is an alternative explanation to the performance of heuristics?

The Bayesian framework offers a different perspective on the bias-variance dilemma. Provided a Bayesian model is correctly specified, it always integrates new data optimally, striking the perfect balance between prior and data. Thus using more information can only improve performance. From the Bayesian standpoint, a less-is-more effect can arise only if a model uses the data incorrectly, for example by weighting it too heavily relative to prior knowledge (e.g., with ordinary linear regression, where there effectively is no prior). In that case, the data might indeed increase estimation variance to the point that ignoring some of the information could improve performance. However, that can never be the best solution. One can always obtain superior predictive performance by using all of the information but tempering it with the appropriate prior.

Heuristics may work well in practice because they correspond to infinitely strong priors that make them oblivious to aspects of the training data, but they will usually be outperformed by a prior of finite strength that leaves room for learning from experience. That is, the strong form of less-is-more, that one can do better with heuristics by throwing out information rather than using it, is false. The optimal solution always uses all relevant information, but it combines that information with the appropriate prior. In contrast, no amount of data can overcome the heuristics’ inductive biases.

So why have heuristics proven to be so useful? According this Bayesian argument, it is not because of a “computational advantage of simplicity per se, but rather to the fact that simpler models can approximate strong priors that are well-suited to the true structure of the environment.”

An interesting question from this work is whether our minds use heuristics as a good approximation of complex models, or whether heuristics are good approximations of more complex processes that the mind uses. The authors write:

Although the current contribution is formal in nature, it nevertheless has implications for psychology. In the psychological literature, heuristics have been repeatedly pitted against full-information algorithms that differentially weight the available information or are sensitive to covariation among cues. The current work indicates that the best-performing model will usually lie between the extremes of ordinary linear regression and fast-and-frugal heuristics, i.e., at a prior of intermediate strength. Between these extremes lie a host of models with different sensitivity to cue-outcome correlations in the environment.

One question for future research is whether heuristics give an accurate characterization of psychological processing, or whether actual psychological processing is more akin to these more complex intermediate models. On the one hand, it could be that implementing the intermediate models is computationally intractable, and thus the brain uses heuristics because they efficiently approximate these more optimal models. This case would coincide with the view from the heuristics-and-biases tradition of heuristics as a tradeoff of accuracy for efficiency. On the other hand, it could be that the brain has tractable means for implementing the intermediate models (i.e., for using all available information but down-weighting it appropriately). This case would be congruent with the view from ecological rationality where the brain’s inferential mechanisms are adapted to the statistical structure of the environment. However, this possibility suggests a reinterpretation of the empirical evidence used to support heuristics: heuristics might fit behavioral data well only because they closely mimic a more sophisticated strategy used by the mind.

There have been various recent approaches looking at the compatibility between psychologically plausible processes and probabilistic models of cognition. These investigations are interlinked with our own, and while most of that work has focused on finding algorithms that approximate Bayesian models, we have taken the opposite approach. This contribution reiterates the importance of applying fundamental machine learning concepts to psychological findings. In doing so, we provide a formal understanding of why heuristics can outperform full-information models by placing all models in a common probabilistic inference framework, where heuristics correspond to extreme priors that will usually be outperformed by intermediate models that use all available information.

The (open access) paper contains a lot more detail - and the maths - and I recommend reading it.