Earlier this year, I reviewed a research study claiming that dogs orient themselves to fluctuations in the earth’s magnetic field when defecating. I was asked to re-post the review on Publons, a site which publishes reviews of journal articles. The authors have posted a response there which answers some concerns I expressed in my review, but which also illustrates some of the same misconceptions about statistics and hypothesis testing that I originally discussed. I have responded both here and on the Publons forum.
The summary of the reactions of the media on our paper is very fitting and we agree. The critic of our study is, however, biased and indicates that the author did not read the paper carefully, misinterpreted it in some cases, and, in any case is so “blinded” by statistics that he forgets biology. Statistics is just a helpful mean to prove or disprove observed phenomena. The problem is that statistics can “prove” phenomena and relations which actually do not exist, but it can also “disprove” phenomena which objectively exist. So, not only approaches which ignore proper statistics might be wrong but also uncritical sticking on statistical purity and ignoring real life.
To begin with, I believe the author and I agree that statistics are easily and commonly misused in science. Unfortunately, this response seems to perpetuate some of the misconceptions about the role of statistics in testing hypotheses I discussed in my original critique.
Statistics never prove or disprove anything. Schema such as Hill’s Criteria of Causation and other mechanisms for evaluating the evidence for relationships observed in research studies illustrate the fact that establishing the reality of hypothesized phenomena in nature is a complex business that must rest on a comprehensive evaluation of many different kinds of evidence. It is unfortunate that p-values have become the sine qua non of validating explanations of natural phenomena, at least in medicine (which is the domain I am most familiar with). The work of John Ionnidis and the growing interest in Bayesian statistical methods are examples of the move in medical research to address the problem of improper use and reliance on frequentist statistical methods.
That said, these methods do have an important role in data analysis, and they contribute significantly to our ability to control for chance and other sources of error in research. The proper role of statistical hypothesis testing is to help assess the likelihood that our findings might be due to chance or confounding variables, which humans are notoriously terrible at recognizing. If we employ these tools improperly, then they cease to fulfill this function and instead they generate a false impression of truth or reliability for results that may easily be artifacts of chance or bias.
The authors accuse me of being “so ‘blinded’ by statistics that he forgets biology.” This is ironic since their paper uses statistics to “prove” something which a broader consideration of biology, evolution, and other information would suggest is improbable. Even if the statistical methods were perfectly and properly applied, they would not be “proof” of anything any more than improper use of statistics would be definitive “disproof” or the authors’ hypothesis. While I discussed some concerns about how statistics were used in the paper, my objections were broader than that, which the authors do not appear to acknowledge.
The author of this critic blames us of “data mining”. Well, first we should realize that there is nothing wrong about data mining. This is an approach normally used in current biology and a source of many interesting and important findings. We would like to point out that we have not “played” with statistics in order to find out eventually some “positive” results. And we have definitively not sorted data out. We just tested several hypotheses and always when we rejected one, we returned all the cards (i.e. data) into the game and tested, independently, anew, another hypothesis.
Though I am not a statistician, I believe there is a consensus that while exploratory analysis of data is, of course, appropriate and necessary, the post-hoc application of statistical significance tests to data after patterns in the data have already been observed is incorrect and misleading. This is what the paper appeared to suggest was done, and this would fit the definition of inappropriate data-dredging.
Note also that we performed this search for the best explanation in a single data sample of one dog only, the borzoi Diadem, for which we had most data. When we had found a clue, we tested this final hypothesis in other dogs, now without Diadem.
This was not indicated in the description of the methods provided in the original paper. If the exploratory analysis was done with one data set while the authors remained blind to the data set actually analyzed in the paper, then that would be an appropriate method of data analysis. The subsequent statistically significant results would not, of course, necessarily prove the hypothesis to be true, but they would at least reliably indicate the likelihood that they were due solely to chance effects.
This does not, however, entirely answer the concern that the study began without a defined hypothesis and examined a broad range of behaviors and magnetic variables in order to identify a pattern or relationship. As exploratory, descriptive work this is, of course, completely appropriate. But the authors then use statistical hypothesis testing to support very strong claims to have “proven” a hypothesis not even identified until after the data collection was completed. This seems a questionable way to employ frequentist statistical methods.
Let us illustrate our above arguments about statistics and “real life” on two examples. Most medical diagnoses are done through exclusion or verification of different hypotheses in subsequent steps. Does it mean that when the physician eventually finds that a patient suffers under certain illness, the diagnosis must be considered improbable because the physician has already before tested (and rejected) several other hypotheses?
This analogy is inapplicable. The process of inductive reasoning a clinician engages in to seek a diagnosis in an individual patient is not truly analogous to the process of collecting data and then evaluating it statistically to assess the likelihood that patterns seen in the data are due to chance. Making multiple statistical comparisons, particularly after one has already sought for patterns in the data, invalidates the application of statistical hypothesis testing. The fact that in other contexts, and without the use of such statistical methods, people consider possible explanations and then accept or reject them based on their observations is irrelevant.
Or imagine that we want to test the hypothesis that the healthy human can run one kilometer with an average speed of 3 m/s. We find volunteers all over the country who should organize races and measure the speed. We shall get a huge sample of data, we have an impression that our hypothesis is correct but the large scatter makes the result insignificant. So we try to find out what could be the factors influencing speed. We test the age – and find out that indeed older people are slower than younger ones, so we divide the sample into age categories, but the scatter is still too high, so we test the effect of sex, we find a slight influence, but it still cannot explain the scatter, we test the position of the sun and time of the day, but find no effect, we test the effect of wind, but the wind was weak or it was windless during races, so we find no effect. We are desperate and we visit the places where the races took place – and we find the clue: some races were done downhill (and people ran much faster), some uphill (and people ran much slower), those who ran in flat land ran on average with the speed we expected. So we can now conclude that our hypothesis was correct and moreover we found an effect of the slope on running speed. We publish a paper describing these findings and then you publish a critic arguing that our approach was just data mining and was wrong and hence our observation is worthless and that the slope has no effect on running speed at all. Absurd!
Again, this example simply describes a process for considering and evaluating multiple variables in order to explain an observed outcome, which is not the objection raised to the original paper. If the only hypothesis in a study such as described here was that at least one human being could run this fast, then a single data point would be sufficient proof and statistics would be unnecessary. However, if one is trying to explain differences in the average speed of different groups of people based on the sorts of variables mentioned, the reliability of the conclusions and the appropriateness of the statistical methods used would depend on how the data was collected and analyzed. In any case, nothing about this has any direct relevance to whether or not the data collection and analysis in the original paper was appropriate or justified the authors’ conclusions.
As I said in the original critique, this study raises an interesting possibility; that dogs may adjust their behavior to features of the magnetic field of the earth. The study was clearly a broadly targeted exploration of behavior and various features of the magnetic environment: “we monitored spontaneous alignment in dogs during diverse activities (resting, feeding and excreting) and eventually focused on excreting (defecation and urination incl. marking) as this activity appeared to be most promising with regard to obtaining large sets of data independent of time and space, and at the same time it seems to be least prone to be affected by the surroundings.” It did not apparently start with a specific, clearly defined hypothesis and prediction, so in this sense it seems an interesting exploratory project.
However, with such a broad focus, with mostly post-hoc hypothesis generation, and with a lack of clear controls for a number of possible alternative explanations, the study cannot be viewed as definitive “proof” of the validity of the explanation the authors provide for their observations, though this is what is claimed in the paper: “…for the first time that (a) magnetic sensitivity was proved in dogs, (b) a measurable, predictable behavioral reaction upon natural MF fluctuations could be unambiguously proven in a mammal, and (c) high sensitivity to small changes in polarity, rather than in intensity, of MF was identified as biologically meaningful.”
I agree with the authors that their results are interesting and should be a stimulus for further research, but I do not agree that the results provide the unambiguous proof they claim. As always, replication and research focused on testing specific predictions based on the hypothesis put forward in this report, with efforts to account for alternative explanations of these observations, will be needed to determine whether the authors’ confidence in their findings is justified.