Daily Archives: May 6, 2016


Does Data Cleaning Matter? A Resounding Yes!

Dirty Data Wrong PictureHave you ever had to rerun an analysis because you discovered something askew in your data?

Does everyone want your data before you think the data are ready for prime time?

Is the demand for your data greater and more urgent than the time you have to prepare it?

Does data cleaning play second fiddle to data analysis in your shop?

Chances are if you are a social or health scientist, you may have said yes to at least one of these questions. With the pressure for real-time data and results we got curious about the impact data cleaning has on results. Our parent company, Datacorp, conducted a study to test the impact data cleaning has on analytic results. Our findings have significant implications for anyone who relies on raw, uncleaned data to make decisions.

We conducted secondary analysis of participant-level survey data for two human services programs to determine the impact data cleaning has on demographics, program outcomes, predictors of program success, and predictors of program retention. We found that data cleaning significantly affected not only the results but conclusions drawn from the analysis, and this impact increased with the complexity of the analysis. The impact: A critical health care program decision could have been made incorrectly.

Now, if you’re a Big Data scientist, you may scoff at this finding. After all, it’s only your tax dollars being wasted.  What difference does it make in your world if Amazon recommends one bad title out of seven or eight? But what if Amazon misses an entire demographic? Now, there’s cause for concern.

Data quality is critical for data-driven decisions whether you are in academia or BI. Analysis of dirty data or partially cleaned data can lead to ill-informed conclusions.  Who wants to be responsible for that?

Interested readers are encouraged to contact the author.