Five Top Statistical Mistakes That Derail Research

in Articles from our Newsletter


We are pleased this month to be able to feature an all-new article by Statistics Consultant and Expert, Karen Grace-Martin.  If you don’t need statistics for your research, then you can skip this newsletter, unless you want to be very impressed by (or feel sorry for) your colleagues who do need to use statistics in their work.  If you’re waffling between a qualitative or quantitative dissertation, this article may help you decide.

And if you are just getting started and are overwhelmed by some of the complexities in this article, don’t lose heart.  You will either learn some of it in your advanced stats class, or you can consult with Karen and get some more basic help.  She does NOT assume that you have a high level of expertise before coming to her for help!

Without further ado, let me present Karen’s useful article, which may just save your (academic) life!


– – – – – – – – – – – – – – – – – – – – – – – – – – – –

After nearly 20 years of learning, teaching, consulting, and doing statistics, I’ve come to this conclusion:  statistics classes teach you enough to get started, but real learning happens when you apply statistics to your own data.  Don’t be afraid to make mistakes.  You will make mistakes—that’s how you learn.

There is a special kind of mistake, however, that can derail the entire analysis.  These mistakes are in the four areas that form the basis of any analysis: the research question, the variable measurement, the data, and the design.

1. A Vague Research Question

Recently I worked with a client whose dissertation was on work-family conflict.  Nothing was significant and her committee suggested she go ahead and write it up that way.   But the lack of results surprised her.  We discovered she had not accounted for interactions between her predictors and two crucial controls–gender and presence of children at home.  Her personality measures did affect work-family conflict, but differently for men and women, and for people with and without kids.  Including interactions led to interesting and theoretically meaningful results.

The problem was not really in the analysis.  The problem was in the lack of specificity in the research questions.  She knew these controls were important—she had been sure to measure them in data collection.  Yet they weren’t in the analysis because they weren’t in the research question.

The most useful research questions are written twice—once in theoretical terms and once in terms of the variables that will be used to answer them.  And they are very specific.  They don’t have to be hypotheses—you don’t have to predict the results—but you do have to know exactly what you’re trying to get at with the analysis.

The biggest danger from vague research questions is spending weeks or months analyzing the wrong thing.  At best, this wastes a lot of time.  At worst, you don’t find the results that are really there.

2. Not incorporating the level of measurement of the variables

I started, years ago, as a psychology researcher.  My statistical training at that time focused on Analysis of Variance, the predominant method in my field.  Regression, used for continuous predictors, was considered unnecessary at best.

But when a continuous predictor occasionally crept into my categorical data, I didn’t know how to analyze it.  I had only one choice.  I stuffed those data into an ANOVA using median splits.  In other words, I had a good hammer, so I remolded everything I saw into a nail.

Years of statistics training and experience later, I recognize there are good reasons for categorizing continuous variables.  But forcing data into an inappropriate method just because it’s the one you understand is not one of them.  Arbitrary methods of categorizing like median splits are never justified.  (For the record, I have since seen that Hammer Everything Syndrome is not limited to ANOVA or to psychology).  [Editorial note by Gina:  I call it the “If all you have is a hammer, everything looks like a nail” syndrome.]

The irony, and perhaps tragedy, is arbitrary categorizing throws away information, glossing over real results.  You’re working too hard to find real answers to miss them with the wrong tool.  Make sure the statistical method you use captures all the information in the data.

3. Ignoring issues in the data

A number of data issues can affect an analysis, but the most common is missing data.  Virtually every data set is missing some data.  And, as you have probably noticed, those missing data wreak havoc. Only when the missing data are few in number and a true random subset of the data will they not affect your analyses.

Sample statistics, like regression coefficients and means, become biased when part of the sample is missing for a reason (i.e. some values or conditions are missing more than others).  This bias gets worse as more data are missing.

When missing-ness is truly random (deer ate half your sapling sample or survey respondents accidentally skipped a page), there is no problem with bias, but power suffers.  The default solution is to drop all cases with any missing data.  A few observations missing across many variables results in half the cases being dropped.  The sample you worked so hard to collect gets demolished.

Other solutions to missing data make the bias worse.  But new, modern techniques result in no bias and full power…even if half the sample is missing.  They’re so new, though, you probably didn’t learn about them in class.

Both of these solutions are now available in all common statistical packages, and they are worth taking time to learn.

4. Failing to account for all aspects of the design

While ignoring the research question, variable measurement, or data issues can get you into trouble, ignoring the design does an especially good job of landing you in deep water.

This is especially true of multilevel designs.  Multilevel designs have some sort of clustering—patients within hospitals, students within classes, responses within subject.  Repeated measures and panel designs are examples of multilevel designs.

If you have a multilevel design, you must account for it in the analysis.  If you don’t, you’ll have inaccurate p-values.  They could be too high or too low, depending on the data, but you won’t know it—there are no indicators.  This means you’ll either claim results that aren’t there or won’t find any results.  Neither is good.

If you think you have a multilevel design, brush up on the design issues.  They’re not hard to recognize once you know what to look for.

5. Not doing an analysis plan before collecting data

While complexities in the data can’t always be anticipated, complexities in the research question, the design, and the variables usually can.  Failing to review and make changes to these three while you still can—before collecting data—leaves you with unnecessarily complicated analyses.  Collecting a covariate, rewording a survey question, or making simple design changes can greatly simplify an analysis.

Make sure you know the general statistical method you need before you collect the data.  If you’re not sure, Stop.  Do not pass Go and Do Not Collect Data.  Go figure it out now.  I know it’s hard to put off data collection when you’re ready to go, but it will save you many, many months of frustration later on.

Cultivate a simple habit:  plan the data analysis before collecting data.  So while this last entry is not so much about statistical error as statistical practice, it’s the one that’s easiest to remedy and has the most far-reaching consequences.

Karen Grace-Martin helps academic researchers make sense of statistics so they can analyze data accurately, efficiently, and confidently.  Karen offers online workshops, consulting, and practical resources on data analysis and software.  Get Karen’s free resources and e-newsletter at

Previous post:

Next post: