Faites Attention! Dealing with Inattentive and Insincere Respondents in Experiments

11 Jul

Respondents who don’t pay attention or respond insincerely are in vogue (see the second half of the note). But how do you deal with such respondents in an experiment?

To set the context, a toy example. Say that you are running an experiment. And say that 10% of the respondents, in a rush to complete the survey and get the payout, don’t read the survey question that measures the dependent variable and respond randomly to it. In such cases, the treatment effect among the 10% will be centered around 0. And including the 10% would attenuate the Average Treatment Effect (ATE).

More formally, in the subject pool, there is an ATE that is E[Y(1)] – E[Y(0)].  You randomly assign folks, and under usual conditions, they render a random sample of Y(1) or Y(0), which in expectation retrieves the ATE.  But when there is pure guessing, the guess by subject i is not centered around Y_i(1) in the treatment group or Y_i(0) in the control group.  Instead, it is centered on some other value that is altogether unresponsive to treatment. 

Now that we understand the consequences of inattention, how do we deal with it?

We could deal with inattentive responding under compliance, but it is useful to separate compliance with the treatment protocol, which can be just picking up the phone, from attention or sincerity with which the respondent responds to the dependent variables. On a survey experiment, compliance plausibly adequately covers both, but cases where treatment and measurement are de-coupled, e.g., happen at different times, it is vital to separate the two.

On survey experiments, I think it is reasonable to assume that:

  1. the proportion of people paying attention are the same across Control/Treatment group, and
  2. there is no correlation between who pays attention and assignment to the control group/treatment group, e.g., men are inattentive in the treatment group and women in the control group.

If the assumptions hold, then the worst we get is an estimate on the attentive subset (principal stratification). To get at ATE with the same research design (and if you measure attention pre-treatment), we can post-stratify after estimating the treatment effect on the attentive subset and then re-weight to account for the inattentive group.

The experimental way to get at attenuation would be to manipulate attention, e.g., via incentives, after the respondents have seen the treatment but before the DV measurement has begun. For instance, see this paper.

Attenuation is one thing, proper standard errors another. People responding randomly will also lead to fatter standard errors, not just because we have fewer respondents but because as Ed Haertel points out (in personal communication):

  1. “The variance of the random responses could be [in fact, very likely is: GS] different [from] the variances in the compliant groups.”
  2. Even “if the variance of the random responses was zero, we’d get noise because although the proportions of random responders in the T and C groups are equal in expectation, they will generally not be exactly the same in any given experiment.”

Wanted: Effects That Support My Hypothesis

8 May

Do survey respondents account for the hypothesis that they think people fielding the survey have when they respond? The answer, according to Mummolo and Peterson, is not much.

Their paper also very likely provides the reason why—people don’t pay much attention. Figure 3 provides data on manipulation checks—the proportion guessing the hypothesis being tested correctly. The change in proportion between control and treatment ranges from -.05 to .25, with a bulk of changes in Qualtrics between 0 and .1. (In one condition, authors even offer an additional 25 cents to give a result consistent with the hypothesis. And presumably, people need to know the hypothesis before they can answer in line with it.) The faint increase is especially noteworthy given that on average, the proportion of people in the control group who guess the hypothesis correctly—without the guessing correction—is between .25–.35 (see Appendix B; pdf).

So, the big thing we may have learned from the data is how little attention survey respondents pay. The numbers obtained here are similar to those in Appendix D of Jonathan Woon’s paper (pdf). The point is humbling and suggests that we need to: a) invest more in measurement, and b) have yet larger samples, which is an expensive way to overcome measurement error—a point Gelman has made before.

There is also the point about the worthiness of including ‘manipulation checks.’ Experiments tell us ATE of what we manipulate. The role of manipulation checks is to shed light on ‘compliance.’ If conveying experimenter demand clearly and loudly is a goal, then the experiments included probably failed. If the purpose was to know whether clear but not very loud cues about ‘demand’ matter—and for what it’s worth, I think it is a very reasonable goal; pushing further, in my mind, would have reduced the experiment to a tautology—the paper provides the answer.

Error Free Multi-dimensional Thinking

1 May

Some recent research suggests that Americans’ policy preferences are highly constrained, with a single dimension able to correctly predict over 80% of the responses (see Jessee 2009, Tausanovitch and Warshaw 2013). Not only that, adding a new (orthogonal) dimension doesn’t improve prediction success by more than a couple of percentage points.

All this flies in the face of conventional wisdom in American Politics, which is roughly antipodal to the new view: most people’s policy preferences are unstructured. In fact, many people don’t have any real preferences on many of the issues (`non-preferences’). Evidence that is most often cited in support of this view comes from Converse – weak correlation between preferences across measurement waves spanning two years (r ~ .4 to .5), and even lower within wave cross-issue correlations (r ~ .2).

What explains this double disagreement — over the authenticity of preferences, and over the structuration of preferences?

First, the authenticity of preferences. When reports of preferences change across waves, is it a consequence of attitude change or non-preferences or measurement error? In response to concerns about long periods between test-retest – which allowed for opinions to genuinely change – researchers tried shorter time periods. Correlations were notably stronger (r ~ .6 to .9)(see Brown 1970). But the sheen of these healthy correlations was worn off by concerns that stability was merely an artifact of people remembering and reproducing what they put down last time.

Redemption of correlations over longer time periods came from Achen (1975). While few of the assumptions behind the redemption are correct – notably uncorrelated errors (across individuals, waves, etc.) – for inferences to be seriously wrong, much has to go wrong. More recently, and, subject to validation, perhaps more convincingly, work by Dean Lacy suggests that once you take out the small number of implausible transitions between waves – those from one end of the scale to another – cross-wave correlations are fairly healthy. (This is exactly opposite to the conclusion Converse came to based on a Markov model; he argued that aside from a few consistent responses, rest of the responses were mostly noise.) Much simpler but informative tests are still missing. For instance, it seems implausible that lots of people who hold well-defined preferences on an issue would struggle to pick even the right side of the scale when surveyed. Tallying stability of dichotomized preferences would be useful.

Some other purported evidence for the authenticity of preferences has come from measurement error models that rest upon more sizable assumptions. These models assume an underlying trait (or traits) and pool preferences over disparate policy positions (see, for instance, Ansolabehere, Rodden, and Snyder 2006 but also Tausanovitch and Warshaw 2013). How do we know there is an underlying trait? That isn’t clear. Generally, it is perfectly okay to ask whether preferences are correlated, less so to simply assume that preferences are structured by an unobserved underlying mental construct.

With the caveat that dimensions may not reflect mental constructs, we next move to assessing claims about the dimensionality of preferences. Differences between recent results and conventional wisdom about “constraint” may be simply due to increase in structuration of preferences over time. However, research suggests that constraint hasn’t increased over time (Baldassari and Gelman 2008). Perhaps more plausibly, dichotomization, which presumably reduces measurement error, is behind some of the differences. There are of course less ham-handed ways of reducing measurement error. For instance, using multiple items to measure preferences on a single policy, as psychologists often do. Since it cannot be emphasized enough, the lesson of past two paragraphs is: keep adjustments for measurement error, and measurement of constraint separate.

Analysis suggesting higher constraint may also be an artifact of analysts’ choices. Dimension reduction techniques are naturally sensitive to the pool of items. If a large majority of the items solicit preferences on economic issues (as in Tausanovitch and Warshaw 2013), the first principal component will naturally pick preferences on that dimension. Since the majority of the gains would come from correctly predicting a large majority of the items, gains in percentage correctly predicted would be poor at judging whether there is another dimension, say preferences on cultural issues. Cross-validation across selected large item groups (large enough to overcome idiosyncratic error) would be a useful strategy. And then again, gains in percentage correctly predicted over the entire population may miss subgroups with very different preference structures. For instance, Blacks and Catholics, who tend to be more socially conservative but economically liberal. Lastly, it is possible that preferences on some current issues (such as those used by Jessee 2009) may be more structured (by political conflict) than some old standing issues.

Why Were the Polls so Accurate?

16 Nov

The Quant. Interwebs have overflowed with joy since the election. Poll aggregation works. And so indeed does polling, though you won’t hear as much about it on the news, which is likely biased towards celebrity intellects than the hardworking many. But why were the polls so accurate?

One potential explanation: because they do some things badly. For instance, most fail at collecting “random samples” these days, because of a fair bit of nonresponse bias. This nonresponse bias, if correlated with the propensity to vote, may actually push up the accuracy of the vote choice means. There are a few ways to check this theory.

One way to check this hypothesis: were the results from polls using Likely Voter screens different from those not using them? If not, why not? From the Political Science literature, we know that people who vote (not just those who say they vote) do vary a bit from those who do not vote, even on things like vote choice. For instance, there is just a larger proportion of `independents’ among them.

Other kinds of evidence will be in the form of failure to match population or other benchmarks. For instance, election polls would likely fare poorly when predicting how many people voted in each state. Or tallying up Spanish language households or number of registered. Another way of saying this is that the bias will vary by what parameter we aggregate from these polling data.

So let me reframe the question: how do polls get election numbers right even when they undercount Spanish speakers? One explanation is that there is a positive correlation between selection into polling, and propensity to vote, which makes vote choice means much more reflective of what we will see come election day.

The other possible explanation to all this – post-stratification or other posthoc adjustment to numbers, or innovations in how sampling is done: matching, stratification etc. Doing so uses additional knowledge about the population and can shrink s.e.s and improve accuracy. One way to test such non-randomness: over tight confidence bounds. Many polls tend to do wonderfully on multiple uncorrelated variables, for instance, census region proportions, gender, … etc., something random samples cannot regularly produce.

Poor Browsers and Internet Surveys

14 Jul


  1. older browsers are likelier to display the survey incorrectly.
  2. type of browser can be a proxy for respondent’s proficiency in using computers, and speed of the Internet connection.

People using older browsers may abandon surveys at higher rates than those using more modern browsers.

Using data from a large Internet survey, we test whether people who use older browsers abandon surveys at higher rates, and whether their surveys have larger amount of missing data. Read More >>.

GSS and ANES: Alike Yet Different

1 Jan

The General Social Survey (GSS), run out of National Opinion Research Center at University of Chicago, and American National Election Studies (ANES), which until recently ran out of University of Michigan’s Institute for Social Research, are two preeminent surveys tracking over-time trends in social and political attitudes, beliefs and behavior of the US adult population.

Outside of their shared Midwestern roots, GSS and ANES also share sampling design—both use a stratified random sample, with the selection of PSUs affected by necessities of in-person interviewing, and during the 1980s and 1990s, sampling frame. However, in spite of this relative close coordination in sampling, common mode of interview, responses to few questions asked identically in the two surveys diverge systematically.

In 1996, 2000, 2004, and 2008, GSS and ANES included exact same questions on racial trait ratings. Limiting the sample to just White respondents, mean difference in trait ratings of Whites and Blacks was always greater in ANES – ratings of hardwork and intelligence, almost always statistically significantly so.

Separately, difference in proportion of self-identified Republicans estimated by ANES and GSS is declining over time.

This unexplained directional variance poses a considerable threat to inference. The problem takes additional gravity given that the surveys are the bedrock of important empirical research in social science.

Another Coding Issue in the ANES Cumulative File

29 Dec

Technology has made it easy to analyze data. However, we have paid inadequate attention to developing automation in data analysis software that pays more attention to potential problems with the data itself. For example, I was recently exploring how interviewer rated political knowledge varied by respondent’s level of education within each year over time using ANES cumulative file. It was only when I plotted the confidence bounds (not earlier) that I found that in 2004 7-category education variable (VCF0140a) had fewer than 7 levels—a highly unlikely scenario. To verify, I checked the number of unique levels of education in 2004 and indeed there were only 5.

[1] 6 5 2 3 1

The variable from which the 7-category variable is ostensibly constructed (V043254) in 2004 has 8 levels. Since the plot looks reasonable for 2004, the problem was likely due to the case of (unwarranted) collapsing of adjacent categories than switching order more irresponsibly. Tallying raw counts revealed that categories 6 and 7, 0 and 1, and 4 and 5 had been collapsed.

On to the point about developing software that automatically flags potential problems. It would be nice if the software flagged differing number of levels of the same variable by year. However, this suggestion is piecemeal and more careful thinking ought to be brought to bear to design issues.

The Perils of Balancing Scales

15 Nov

Randomization of scale order (balancing) across respondents is common practice. It is done to ‘cancel’ errors generated by ‘satisficers’ who presumably pick the first option on a scale without regard to content. The practice is assumed to have no impact on the propensity of satisficers to pick the first option, or on other respondents, both somewhat unlikely assumptions.

A far more reasonable hypothesis is that reversing scale order does have an impact on respondents, on both non-satisficers and satisficers. Empirically, people take longer to fill out reverse ordered scales, and it is conceivable that they pay more attention to filling out the responses — reducing satisficing and perhaps boosting the quality of responses, either way not simply ‘canceling’ errors among a subset, as hypothesized.

Within satisficers, without randomization, correlated bias may produce artificial correlations across variables where none existed. For example, satisficers (say uneducated) love candy (love candy to hate candy scale). Such a calamity ought to be avoided. However, in a minority of cases where satisficers true preferences are those expressed in the first choice, randomization will artificially produce null results. Randomization may be more sub-optimal still if there indeed are effects on rest of the respondents.

Within survey experiments, where balancing randomization is “orthogonal” (typically just separate) to the main randomization, it has to be further assumed that manipulation has equal impact on “satisficers” in either reverse or regularly ordered scale, again a somewhat tenuous assumption.

The entire exercise of randomization is devoted not to find out the true preferences of the satisficers, a more honorable purpose, but to eliminate them from the sample. There are better ways to catch ‘satisficers’ than randomizing across the entire sample. One possibility is to randomize within a smaller set of likely satisficers. On knowledge questions, ability estimated over multiple questions can be used to inform propensity the first option (if correct and if chosen) was not a guess. Response latency can be used as well to inform judgments. For attitude questions, follow up questions measuring the strength of attitude etc. can be used to weight responses on attitude questions.

If we are interested in getting true attitudes from ‘satisficers,’ we may want to motivate respondents either by interspersed exhortations that their responses matter, or by providing financial incentives.

Lastly, it is important to note that combining two kinds of systematic error doesn’t make it a ‘random’ error. And no variance in data can be a conservative attribute of data (with hardworking social scientists around).

Expanding the Database Yet Further

25 Feb

“College sophomores may not be people,” wrote Carl Hovland, quoting Tolman. Yet research done on them continues to be a significant part of all research in Social Science. The status quo is a function of cost and effort.

In 2007, Cindy Kam, proposed using the university staff as a way to move beyond the ‘narrow database.’

There are two other convenient ways of expanding the database yet further: alumni, and local community colleges. Using social networks, universities can tap into interested students, and staff acquaintances.

A common (university-wide) platform to recruit and manage panels of such respondents would not only yield greater quality control and convenience but also cost savings.

Coding Issues in the ANES Cumulative File

29 Nov

I have on occasion used American National Election Studies (ANES) cumulative file to do over time comparisons. Roughly half of those times, I have found patterns that don’t make much sense. Only a small fraction of the times when the patterns didn’t make sense have I chosen to investigate the data more closely, as a likely explanation for aberrant patterns. The following ‘finding’ is a result of such effort.

ANES cumulative file (1948–2004) carries a variety of indices. In creating some of the indices, it appears pre-election measures have been combined with post-election measures in some of the years. If that wasn’t enough, at least one of the times, the same index in some years has pre-election measure combined with the post-election measure, while using only post measures in other years. Here’s an example –

‘External Efficacy Index’ (VCF0648) is built out of two items –

Item 1: Public officials don’t care much what people like me think.
Item 2: People like me don’t have any say about what the government does

Item 2 is asked both pre and post-election in some cycles. In 1996, efficacy is built out of –

960568 (pre), 961244 (post)


[you can ID post-election wave questions through the following coding category – Inap, no Post IW]. Post version of 960568 is 961245

While in 2000 it is built out of – 001527 (post), 001528 (post)


I have alerted the ANES staff, and it is likely that the new iteration of the cumulative file will fix this particular issue.