Technology has made it easy to analyze data. However, we have paid inadequate attention to developing automation in data analysis software that pays more attention to potential problems with the data itself. For example, I was recently exploring how interviewer rated political knowledge varied by respondent’s level of education within each year over time using ANES cumulative file. It was only when I plotted the confidence bounds (not earlier) that I found that in 2004 7-category education variable (VCF0140a) had fewer than 7 levels—a highly unlikely scenario. To verify, I checked the number of unique levels of education in 2004 and indeed there were only 5.
unique(nes$vcf0140a[nes$vcf0004=="2004"])
[1] 6 5 2 3 1
The variable from which the 7-category variable is ostensibly constructed (V043254) in 2004 has 8 levels. Since the plot looks reasonable for 2004, the problem was likely due to the case of (unwarranted) collapsing of adjacent categories than switching order more irresponsibly. Tallying raw counts revealed that categories 6 and 7, 0 and 1, and 4 and 5 had been collapsed.
On to the point about developing software that automatically flags potential problems. It would be nice if the software flagged differing number of levels of the same variable by year. However, this suggestion is piecemeal and more careful thinking ought to be brought to bear to design issues.