Analysis of survey data is hard to automate because of the immense variability across survey instruments—differentvariables, differently coded, and named in ways that often defy even the most fecund imagination. What often replaces complete automation is ad-hoc automation—quickly coded functions, e.g. recoding a variable to lie within a particular range, applied by intelligent people frustrated by the lack of complete automation and bored by the repetitiveness of the task. Ad-hoc automation attracts mistakes, as functions are often coded without rigor, and useful alerts and warnings are usually missing.
One way to reduce mistakes is to prevent them from happening. Carefully coded functions with robust error checking and handling, alerts, and passive verbose outputs that are cognizant of our own biases, and bounded attention, can reduce mistakes. Functions that are used most frequently typically need the most attention.
Let’s use the example of recoding a variable to lie between 0 and 1 in R to illustrate how to code a function. Some things to consider:
- Data type: Is the variable numeric, ordinal, or categorical? Let’s say we want to constrain our function to handle only numeric variables. Some numeric variables may be coded as ‘character.’ We may want to seamlessly deal with these issues, and possibly issue warnings (or passive outputs) when improper data types are used.
- Range: The range that the variable takes in the data may not span the entire domain. We want to account for that, but perhaps seamlessly by printing out the range that the variable takes and by also allowing the user to input the true range.
- Missing Values: A variety of functions we may rely on when recoding our variable may take fail (quietly) when fronted with missing values, for example, range(x). We may want to alert the user to the issue but still handle missing values seamlessly.
- A user may not see the actual data so we may want to show the user some of the data by default. Efficient summaries of the data (fivenum, mean, median, etc.) or displaying a few initial items may be useful.
A function that addresses some of the issues:
zero1 <- function(x, minx=NA, maxx=NA) {
# Test the type of x and see if it is a double, or can be transformed into a double
stopifnot(identical(typeof(as.numeric(x)), 'double'))
if(typeof(x)=='character') x <- as.numeric(x)
print(head(x)) #displays first few items
print(paste("Range:", paste(range(x, na.rm=T), collapse=" "))) #shows the range the variable takes in the data
res <- rep(NA, length(x))
if(!is.na(minx)) res <- (x - minx)/(maxx - minx)
if(is.na(minx)) res <- (x - min(x,na.rm=T))/(max(x,na.rm=T) - min(x,na.rm=T))
res
}
These tips also apply to canned functions available in R (and those writing them) and functions in other statistical packages that do not normally display alerts or other secondary information that may reduce mistakes. One can always build on canned functions. For instance, the recode (car package) function can be coded to passively display the correlation between the recoded variable and the original variable by default.
In addition to writing better functions, one may also want to check post hoc. But a caveat about post hoc checks: Post hocchecks are only good at detecting aberrations among the variables you test, and they are costly.
Using prior knowledge:
- Identify beforehand how some variables relate to each other. For example, education is typically correlated with political knowledge, race with partisan preferences, etc. Test these hypotheses. In some cases, these can also be diagnostic of sampling biases.
- Over an experiment, you may have hypotheses about how variables change across time. For example, constraint typically increases across attitude indices over the course of a treatment designed to produce learning. Test these priors.
- Characteristics of the coded variable: If using multiple datasets, check to see if the number of levels of a categorical variable are the same across each dataset. If not, investigate. Cross-tabulations across merged data are a quick way to diagnose problems, which can range from varying codes for missing data to missing levels.