April 26, 2016

Can Time Really Go Backwards? Perhaps With Unclean Data!

April 26, 2016 in Data Management / Data Quality / Data Science / Super Simple Solutions by P. Allison Minugh, Ph.D.

Let’s have a look . . .

Fourteen-year-old widows? Women aged 15 to 19 with 12 children? These are a couple of strange statistics found in early U.S. Census data (Kruskal, 1981). Cleaning longitudinal data we have even found time can go backwards. Or more accurately, survey dates can be out of order and make it look like time has gone backwards. Any calculation that relies on two dates to determine the number of days in between will yield inaccurate reporting and data loss if the dates aren’t “clean”. Simple errors—like an error in a person’s age—can set additional errors into motion and cause cascade effects that seriously impact data quality when other calculations are based on a variable like age.

For example, what if you wanted to know a person’s age when they first . . .

• Smoked cigarettes?
• Drank alcohol?
• Engaged in sexual behavior?
• Had contact with the police?

In this example if the ages reported on these questions are higher than the reported current age, all of these data would be suspect. Here are a few common data errors you can check for using simple frequencies and cross-tabs. They don’t require any heavy lifting from the programming department!

Out of Range Values: Values that fall outside the range of possible response options (e.g., a response of 6 on a scale from 1 to 5).

Implausible Values: Values that have a ceiling beyond which data are impossible or highly unlikely (e.g., age is reported as 125 years old).

Inconsistencies/Impossible Combinations of Values: The combination of two or more values is logically impossible (e.g., someone reports they never drank alcohol in their lifetime then reports past 30-day alcohol use).

Missing Data: No data are entered where data are required (e.g., empty cells for key administrative variables, resulting in data loss).

Formatting Errors: Data that do not adhere to format requirements (e.g., non-standard variable names and labels, data formats, unique identifiers).

Duplicate Data: Identical records submitted.

As data managers, we understand people need the “right” data, right away, and it is far better to prevent errors in the first place. Indeed, with good planning, data entry systems can be designed to prevent most errors—especially the common ones—from making it into a dataset (e.g., rejecting duplicate entries and out-of-range values). For more complicated errors (e.g., inconsistent responses between two survey questions, inconsistencies across time points), someone with a good understanding of the content and expertise in data management can play a critical role in ensuring your data are protected.

References

Kruskal, W. (1981). Statistics in society: Problems unsolved and unformulated. Journal of the American Statistical Association, 76, 505-515.