From Eliezer Yudkowsky on Less Wrong (a few years old, but worth revisiting in the light of my recent Gigerenzer v Kahneman and Tversky post):

When a single experiment seems to show that subjects are guilty of some horrifying sinful bias - such as thinking that the proposition “Bill is an accountant who plays jazz” has a higher probability than “Bill is an accountant” - people may try to dismiss (not defy) the experimental data. Most commonly, by questioning whether the subjects interpreted the experimental instructions in some unexpected fashion - perhaps they misunderstood what you meant by “more probable”.

Experiments are not beyond questioning; on the other hand, there should always exist some mountain of evidence which suffices to convince you.

Here is (probably) the single most questioned experiment in the literature of heuristics and biases, which I reproduce here exactly as it appears in Tversky and Kahneman (1982):

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

Please rank the following statements by their probability, using 1 for the most probable and 8 for the least probable:

(5.2)  Linda is a teacher in elementary school. (3.3)  Linda works in a bookstore and takes Yoga classes. (2.1)  Linda is active in the feminist movement. (F) (3.1)  Linda is a psychiatric social worker. (5.4)  Linda is a member of the League of Women Voters. (6.2)  Linda is a bank teller. (T) (6.4)  Linda is an insurance salesperson. (4.1)  Linda is a bank teller and is active in the feminist movement. (T & F)

(The numbers at the start of each line are the mean ranks of each proposition, lower being more probable.)

How do you know that subjects did not interpret “Linda is a bank teller” to mean “Linda is a bank teller and is not active in the feminist movement”? For one thing, dear readers, I offer the observation that most bank tellers, even the ones who participated in anti-nuclear demonstrations in college, are probably not active in the feminist movement. So, even so, Teller should rank above Teller & Feminist.  …  But the researchers did not stop with this observation; instead, in Tversky and Kahneman (1983), they created a between-subjects experiment in which either the conjunction or the two conjuncts were deleted. Thus, in the between-subjects version of the experiment, each subject saw either (T&F), or (T), but not both. With a total of five propositions ranked, the mean rank of (T&F) was 3.3 and the mean rank of (T) was 4.4, N=86. Thus, the fallacy is not due solely to interpreting “Linda is a bank teller” to mean “Linda is a bank teller and not active in the feminist movement.”

Another way of knowing whether subjects have misinterpreted an experiment is to ask the subjects directly. Also in Tversky and Kahneman (1983), a total of 103 medical internists … were given problems like the following:

A 55-year-old woman had pulmonary embolism documented angiographically 10 days after a cholecstectomy. Please rank order the following in terms of the probability that they will be among the conditions experienced by the patient (use 1 for the most likely and 6 for the least likely). Naturally, the patient could experience more than one of these conditions.

• Dyspnea and hemiparesis

• Calf pain

• Pleuritic chest pain

• Syncope and tachycardia

• Hemiparesis

• Hemoptysis

As Tversky and Kahneman note, “The symptoms listed for each problem included one, denoted B, that was judged by our consulting physicians to be nonrepresentative of the patient’s condition, and the conjunction of B with another highly representative symptom denoted A. In the above example of pulmonary embolism (blood clots in the lung), dyspnea (shortness of breath) is a typical symptom, whereas hemiparesis (partial paralysis) is very atypical.”

In indirect tests, the mean ranks of A&B and B respectively were 2.8 and 4.3; in direct tests, they were 2.7 and 4.6. In direct tests, subjects ranked A&B above B between 73% to 100% of the time, with an average of 91%.

The experiment was designed to eliminate, in four ways, the possibility that subjects were interpreting B to mean “only B (and not A)”. First, carefully wording the instructions:  “…the probability that they will be among the conditions experienced by the patient”, plus an explicit reminder, “the patient could experience more than one of these conditions”. Second, by including indirect tests as a comparison. Third, the researchers afterward administered a questionnaire:

In assessing the probability that the patient described has a particular symptom X, did you assume that (check one): X is the only symptom experienced by the patient? X is among the symptoms experienced by the patient?