Airing Out Your Dirty Laundry: How Dirty Are Your Data?


It’s a dirty little secret that can cost you time, money, lost opportunity, and your reputation: dirty data. Dirty data include missing, invalid, inaccurate, or inconsistent data. According to Gartner, a leading IT research and advisory company, over 25% of critical data in top companies are dirty, and businesses often underestimate the size of the problem. But it’s not just businesses that contend with the consequences of dirty data. Just as poor data quality hurts the business sector, poor data quality hurts researchers, too. It can call into question the scientific rigor of a study result and misguide decision-makers who rely on these data.

Given the scope of this problem and its consequences, you may want to ask yourself: When was the last time I performed a data quality assessment? A data quality assessment involves a data audit to determine data strengths and weaknesses. The results and recommended actions to improve data quality are documented.

What should you look for when you’re auditing your data? A hard and fast list of “must check” items will vary depending on the nature of the project and the data. However, there are a few generic checks analysts and data managers can conduct that require a little front-end effort in exchange for hours of lost labor if these problems occur in the data file. The rest of this post will focus on a few pre-analytic file integrity checks.

It is extremely common for analysts and data managers to use data “extracts” in their work. Extracts typically come from data collection systems that can be either internal or external to the analyst’s organization. If you are importing a data extract, there are some very simple checks you can do to make sure what you imported is what you think it is. Here are three quick tips!

The Four Corners Check
This first check might make you laugh out loud. Look at your data. What? Yes, look at your data! Check the four corners. Are the values in the corners of your imported file the same as the values in the corner of the file before you imported it? Simple, huh? You’d be surprised at the number of programmers who don’t want to look at their data. They’re programmers. Looking at the data isn’t programming. But it works! And you’d be surprised how often data can be “corrupted” or “offset” in the extraction or import process—if you haven’t already experienced it.

File Size Check
Now, that you’ve checked the corners take a look at the rows and columns. Look at the data, again? Yes. Did your program import the same number of rows and columns as you expected? Of course, you can always run a listing of the contents to determine this, but we STRONGLY encourage you to become familiar with your data. It is amazing how profitable and important this intimate relationship with your data will be to you in the long run.

Data Type Check
Are you dealing primarily with numeric data? You may wish to make sure that your data file didn’t somehow get corrupted and come in as all characters or strings. It happens . . .

Stay tuned for more on this important topic!

We'd love to hear what you think...