Most of us in business leadership roles want to base our decisions on sound science.  But what if I was to tell you that the majority of published scientific findings – yes, even in peer-reviewed journals – are flat out wrong?

How are we to discern the minority of published studies that yield truly valid results –those that should justifiably cause us to take action– from the ones that present misleading results?

Why are the majority of published scientific findings likely to be wrong? 

Problems related to study design and execution can certainly contribute, but many of these flaws are detected and remedied in the peer-review process.

More troublesome, is the improper use of statistics.   To make matters worse there is a documented “publication bias” in favor of studies that report “significant” results: studies  that report statistical associations are more likely to be submitted for publication and have a higher likelihood of being published  (and published more quickly)  than those that do not.

How can statistics be a problem? 

Scientists rely on statistical tests to guide them in judging whether an experimental or observational result reflects some real effect or is merely a chance (random) occurrence; however, misuse of or an over-reliance on statistics can produce confusing and contradictory results and interpretations.

As an example, most scientists will accept as likely to be valid a study finding that is likely to be wrong less than one in twenty times.1 To the average layperson, a less than five percent error rate may sound like a reasonable benchmark to accept; however, during the course of analyzing a single given study, scientists often perform dozens and sometimes hundreds of statistical tests.  The cumulative effect is that many scientists accept as likely to be valid several and perhaps dozens of findings that are, in all likelihood, just random associations with no underlying cause and effect relationship. Compounding matters, they will often obscure from the reader the actual number of statistical tests performed and focus their report on the few findings which achieved statistical significance.

This is not to say that all published scientific findings don’t have value; however, a single scientific study that purports to demonstrate a potential cause and effect relationship should be treated skeptically.  Personally, I would like to see scientists practice greater humility when reporting their research findings.

So, how can one avoid being misled?

In 1965 a British occupational physician Austin Bradford Hill published nine “viewpoints” (which have subsequently been referred to as criteria) for assessing whether a reported association between some factor and disease could be a causal one.  The following lessons have guided me (and many others) through over 35 years of evaluating chemical exposures in relation to human and/or environmental effects, but they are broadly applicable to nearly every type of scientific research.

Interestingly enough, Hill ascribed little weight to statistics in establishing his criteria.  Although they have been frequently discussed and debated during the past nearly 50 years, they have stood the test of time and are part of the rigorous teaching in many schools of public health globally.  They are:

  1. Strength of association.  The stronger the association between the factor and the disease, the more likely it is to reflect a causal relationship. Hill used the example of chimney sweeps, who died of scrotal cancer at rates 200 times the normal population.  This is perhaps an extreme example, but as a rule of thumb if the strength of association is less than 2-3 it probably deserves more scrutiny.
  2. Consistency (replicability).  To the points made above, a single study reporting a finding should be treated skeptically and it is critical that the finding be independently replicated by other investigators.  Ideally, almost every study should support the association for there to be causation.  Hill used the example of cigarettes as a cause of lung cancer where there were numerous studies, of different design, nearly all demonstrating a strong association.
  3. Specificity.  This is arguably the criterion which has been most debated since it has subsequently been demonstrated that many chronic diseases can have multiple causes and some exposures cause multiple effects.  Nevertheless, it can be stated in short, “if specificity exists we may be able to draw conclusions without hesitation; if it is not apparent, we are not thereby necessarily left sitting irresolutely on the fence.”
  4. Temporality.  The cause should precede the effect in time.  This is the only absolutely essential criterion that must be met and it seems obvious, however, there are many published studies which employ a type of design whereby samples to measure exposure and disease are made at the same time and it is impossible to determine which came first.  For example, many studies which measure chemicals in the blood or in other biological specimens and then relate the levels found to current disease states or other physiological parameters suffer from this particular problem.
  5. Biological gradient (Dose-Response).  The frequency and/or intensity of the biological response should increase with the size of the dose of exposure.  The presence of a dose-response relationship certainly strengthens the possibility of cause and effect.  Conversely, the absence of one should be considered weaker evidence, although it cannot completely rule out an association as the doses tested may have been below the threshold necessary to cause an effect or there may have been gross errors made in measuring exposures.
  6. Plausibility. The effect must have biologic plausibility. There should be a theoretical basis for positing a causal association and it should not violate well-established laws of the universe.  On the other hand, the association being reported may be one that is new to science or medicine and shouldn’t be dismissed too lightly; however, the more novel it is the more we should see evidence of an independent replication of the finding before accepting it as real.
  7. Coherence.  Any inference of cause-and-effect should not seriously conflict with the generally known facts of the natural history and biology of the disease.  An example is the “hygiene hypothesis” as a cause of some autoimmune disorders and allergies because it is coherent with trends in developed countries of both fewer childhood infections and increased prevalence of autoimmune disorders and allergies.
  8. Experiment.  Findings from studies which employ well-designed experiments (such as controlled laboratory, animal model, or clinical trial studies) whereby subjects are randomly and blindly allocated into exposure and contemporary control groups and where many significant variables are held stable to prevent them interfering with the results should be accorded more weight than those that derive from studies which do not employ such methods.  Also, if the disease or biological effect can be shown by experiment to be prevented by withdrawal of the exposure, then this strengthens an interpretation of causality.
  9. Analogy.  When something is suspected of causing an effect, then other factors similar or analogous to the supposed cause should also be considered and identified as a possible cause or otherwise eliminated from the investigation.

Hill offered these guidelines as aids to help us to make up our minds on the fundamental question:  is there any other way of explaining the set of facts before us, is there any other answer equal or more, likely than cause and effect? Hill also cautioned against waiting for full and compelling evidence of causation before taking action if the circumstances warranted.  “All scientific work is incomplete, whether it be observational or experimental. All scientific work is liable to be upset or modified by advancing knowledge. That does not confer upon us a freedom to ignore the knowledge we already have, or to postpone the action that it appears to demand at a given time.”

The decision to take action must consider the full weight of evidence available and the severity of the consequences.

I hope these lessons are useful to you the next time you read a scientific paper that purports to have discovered some new effect or more likely when you read a press release or newspaper article or hear the finding reported on the evening news.  Sutherland et al (2013) have recently published a much longer list of tips that they wrote for policy makers who must integrate scientific claims into their decision-making which makes an excellent complement to this blog and which you might find an interesting read.

1 To be technically accurate, scientists cannot prove anything right – all we can do is show something is false.  So instead of trying to prove our hypothesis is correct, we try to prove the alternatives are wrong.  This is referred to as testing the null hypothesis.  Scientists typically choose a p-value of less than 0.05 for judging a particular study finding as statistically significant.  The p-value is actually the probability that the null hypothesis is correct.  So if a finding has a p-value of less the 0.05 we typically reject the null hypothesis in favor of the alternative.
Goodman, S.N. 2008. A dirty dozen: Twelve P-value misconceptions. Seminars in Hematology 45:135-140. doi:10.1053/j.seminhematol.  04.003.

Hill, A.B. 1965. The Environment and Disease: Association or Causation?”. Proceedings of the Royal Society of Medicine 58 (5): 295–300.

Ioannidis, J.P.A. 2005. Why most published research findings are false. PLoS Medicine 2(August): 0101-0106.

Siegfried T.  2010.  Odds are its wrong: science fails to face the shortcomings of statistics. Science News March 27, 2010.

Sutherland WJ, Spiegelhalter D, Burgman M. 2013. Policy: Twenty tips for interpreting scientific claims.  Nature 503: 335-337, November 21, 2013.

[photos 1 and 2 are courtesy of wikimedia commons]