Yearly Archives: 2016


Does Data Cleaning Matter? A Resounding Yes!

Dirty Data Wrong PictureHave you ever had to rerun an analysis because you discovered something askew in your data?

Does everyone want your data before you think the data are ready for prime time?

Is the demand for your data greater and more urgent than the time you have to prepare it?

Does data cleaning play second fiddle to data analysis in your shop?

Chances are if you are a social or health scientist, you may have said yes to at least one of these questions. With the pressure for real-time data and results we got curious about the impact data cleaning has on results. Our parent company, Datacorp, conducted a study to test the impact data cleaning has on analytic results. Our findings have significant implications for anyone who relies on raw, uncleaned data to make decisions.

We conducted secondary analysis of participant-level survey data for two human services programs to determine the impact data cleaning has on demographics, program outcomes, predictors of program success, and predictors of program retention. We found that data cleaning significantly affected not only the results but conclusions drawn from the analysis, and this impact increased with the complexity of the analysis. The impact: A critical health care program decision could have been made incorrectly.

Now, if you’re a Big Data scientist, you may scoff at this finding. After all, it’s only your tax dollars being wasted.  What difference does it make in your world if Amazon recommends one bad title out of seven or eight? But what if Amazon misses an entire demographic? Now, there’s cause for concern.

Data quality is critical for data-driven decisions whether you are in academia or BI. Analysis of dirty data or partially cleaned data can lead to ill-informed conclusions.  Who wants to be responsible for that?

Interested readers are encouraged to contact the author.


Can Time Really Go Backwards? Perhaps With Unclean Data!

Melting ClockLet’s have a look . . .

Fourteen-year-old widows? Women aged 15 to 19 with 12 children? These are a couple of strange statistics found in early U.S. Census data (Kruskal, 1981). Cleaning longitudinal data we have even found time can go backwards. Or more accurately, survey dates can be out of order and make it look like time has gone backwards. Any calculation that relies on two dates to determine the number of days in between will yield inaccurate reporting and data loss if the dates aren’t “clean”. Simple errors—like an error in a person’s age—can set additional errors into motion and cause cascade effects that seriously impact data quality when other calculations are based on a variable like age.

For example, what if you wanted to know a person’s age when they first . . .

• Smoked cigarettes?
• Drank alcohol?
• Engaged in sexual behavior?
• Had contact with the police?

In this example if the ages reported on these questions are higher than the reported current age, all of these data would be suspect. Here are a few common data errors you can check for using simple frequencies and cross-tabs.  They don’t require any heavy lifting from the programming department!

Out of Range Values: Values that fall outside the range of possible response options (e.g., a response of 6 on a scale from 1 to 5).

Implausible Values: Values that have a ceiling beyond which data are impossible or highly unlikely (e.g., age is reported as 125 years old).

Inconsistencies/Impossible Combinations of Values: The combination of two or more values is logically impossible (e.g., someone reports they never drank alcohol in their lifetime then reports past 30-day alcohol use).

Missing Data: No data are entered where data are required (e.g., empty cells for key administrative variables, resulting in data loss).

Formatting Errors: Data that do not adhere to format requirements (e.g., non-standard variable names and labels, data formats, unique identifiers).

Duplicate Data: Identical records submitted.

As data managers, we understand people need the “right” data, right away, and it is far better to prevent errors in the first place. Indeed, with good planning, data entry systems can be designed to prevent most errors—especially the common ones—from making it into a dataset (e.g., rejecting duplicate entries and out-of-range values). For more complicated errors (e.g., inconsistent responses between two survey questions, inconsistencies across time points), someone with a good understanding of the content and expertise in data management can play a critical role in ensuring your data are protected.

References

Kruskal, W. (1981). Statistics in society: Problems unsolved and unformulated. Journal of the American Statistical Association, 76, 505-515.


Airing Out Your Dirty Laundry: How Dirty Are Your Data?

It’s a dirty little secret that can cost you time, money, lost opportunity, and your reputation: dirty data. Dirty data include missing, invalid, inaccurate, or inconsistent data. According to Gartner, a leading IT research and advisory company, over 25% of critical data in top companies are dirty, and businesses often underestimate the size of the problem. But it’s not just businesses that contend with the consequences of dirty data. Just as poor data quality hurts the business sector, poor data quality hurts researchers, too. It can call into question the scientific rigor of a study result and misguide decision-makers who rely on these data.

Given the scope of this problem and its consequences, you may want to ask yourself: When was the last time I performed a data quality assessment? A data quality assessment involves a data audit to determine data strengths and weaknesses. The results and recommended actions to improve data quality are documented.

What should you look for when you’re auditing your data? A hard and fast list of “must check” items will vary depending on the nature of the project and the data. However, there are a few generic checks analysts and data managers can conduct that require a little front-end effort in exchange for hours of lost labor if these problems occur in the data file. The rest of this post will focus on a few pre-analytic file integrity checks.

It is extremely common for analysts and data managers to use data “extracts” in their work. Extracts typically come from data collection systems that can be either internal or external to the analyst’s organization. If you are importing a data extract, there are some very simple checks you can do to make sure what you imported is what you think it is. Here are three quick tips!

The Four Corners Check
This first check might make you laugh out loud. Look at your data. What? Yes, look at your data! Check the four corners. Are the values in the corners of your imported file the same as the values in the corner of the file before you imported it? Simple, huh? You’d be surprised at the number of programmers who don’t want to look at their data. They’re programmers. Looking at the data isn’t programming. But it works! And you’d be surprised how often data can be “corrupted” or “offset” in the extraction or import process—if you haven’t already experienced it.

File Size Check
Now, that you’ve checked the corners take a look at the rows and columns. Look at the data, again? Yes. Did your program import the same number of rows and columns as you expected? Of course, you can always run a listing of the contents to determine this, but we STRONGLY encourage you to become familiar with your data. It is amazing how profitable and important this intimate relationship with your data will be to you in the long run.

Data Type Check
Are you dealing primarily with numeric data? You may wish to make sure that your data file didn’t somehow get corrupted and come in as all characters or strings. It happens . . .

Stay tuned for more on this important topic!