The (Mis)Information Age: Measuring and Improving ‘Digital Literacy’

31 Aug

The information age has bought both bounty and pestilence. Today, we are deluged with both correct and incorrect information. If we knew how to tell apart correct claims from incorrect, we would have inched that much closer to utopia. But the lack of nous in telling apart generally ‘obvious’ incorrect claims from correct claims has brought us close to the precipice of disarray. Thus, improving people’s ability to identify untrustworthy claims as such takes on urgency.

Before we find fixes, it is good to measure how bad things are and what things are bad. This is the task the following paper sets itself by creating a ‘digital literacy’ scale. (Digital literacy is an overloaded term. It means many different things, from the ability to find useful information, e.g., information about schools or government programs, to the ability to protect yourself against harm online (see here and here for how frequently people’s accounts are breached and how often they put themselves at risk of malware or phishing), to the ability to identify incorrect claims as such, which is how the paper uses it.)

Rather than build a skill assessment kind of a scale, the paper measures (really predicts) skills indirectly using some other digital literacy scales, whose primary purpose is likely broader. The paper validates the importance of various constituent items using variable importance and model fit kinds of measures. There are a few dangers of doing that:

  1. Inference using surrogates is dangerous as the weakness of surrogates cannot be fully explored with one dataset. And they are liable not to generalize as underlying conditions change. We ideally want measures that directly measure the construct.
  2. Variable importance is not the same as important variables. For instance, it isn’t clear why “recognition of the term RSS,” the “highest-performing item by far” has much to do with skill in identifying untrustworthy claims.

Some other work builds uncalibrated measures of digital literacy (conceived as in the previous paper). As part of an effort to judge the efficacy of a particular way of educating people about how to judge untrustworthy claims, the paper provides measures of trust in claims. The topline is that educating people is not hard (see the appendix for the description of the treatment). A minor treatment (see below) is able to improve “discernment between mainstream and false news headlines.”

Understandably, the effects of this short treatment are ‘small.’ The ITT short-term effect in the US is: “a decrease of nearly 0.2 points on a 4-point scale.” Later in the manuscript, the authors provide the substantive magnitude of the .2 pt net swing using a binary indicator of perceived headline accuracy: “The proportion of respondents rating a false headline as “very accurate” or “somewhat accurate” decreased from 32% in the control condition to 24% among respondents who were assigned to the media literacy intervention in wave 1, a decrease of 7 percentage points.” The .2 pt. net swing on a 4 point scale leading to a 7% difference is quite remarkable and generally suggests that there is a lot of ‘reverse’ intra-category movement that the crude dichotomization elides over. But even if we take the crude categories as the quantity of interest, a month later in the US, the 7 percent swing is down to 4 percent:

“…the intervention reduced the proportion of people endorsing false headlines as accurate from 33 to 29%, a 4-percentage-point effect. By contrast, the proportion of respondents who classified mainstream news as not very accurate or not at all accurate rather than somewhat or very accurate decreased only from 57 to 55% in wave 1 and 59 to 57% in wave 2.

Guess et al. 2020

The opportunity to mount more ambitious treatments remains sizable. So does the opportunity to more precisely understand what aspects of the quality of evidence people find hard to discern. And how we could release products that make their job easier.

Reliable Respondents

23 Jul

Setting aside concerns about sampling, the quality of survey responses on popular survey platforms is abysmal (see here and here). Both insincere and inattentive respondents are at issue. A common strategy for identifying inattentive respondents is to use attention checks. However, many of these attention checks stick out like sore thumbs. The upshot is that an experience respondent can easily spot them. A parallel worry about attention checks is that inexperienced respondents may be confused by them. To address the concerns, we need a new way to identify inattentive respondents. One way to identify such respondents is to measure twice. More precisely, measure immutable or slowly changing traits, e.g., sex, education, etc., twice across closely spaced survey waves. Then, code cases where people switch answers across the waves on such traits as problematic. And then, use survey items, e.g., self-reports and metadata, e.g., survey response time, metadata on IP addresses, etc. in the first survey to predict problematic switches using modern ML techniques that allow variable selection like LASSO (space is at a premium). Assuming the equation holds, future survey creators can use the variables identified by LASSO to identify likely inattentive respondents.     

Faites Attention! Dealing with Inattentive and Insincere Respondents in Experiments

11 Jul

Respondents who don’t pay attention or respond insincerely are in vogue (see the second half of the note). But how do you deal with such respondents in an experiment?

To set the context, a toy example. Say that you are running an experiment. And say that 10% of the respondents in a rush to complete the survey and get the payout don’t read the survey question that measures the dependent variable and respond randomly to it. In such cases, the treatment effect among the 10% will be centered around 0. And including the 10% would attenuate the Average Treatment Effect (ATE).

More formally, in the subject pool, there is an ATE that is E[Y(1)] – E[Y(0)].  You randomly assign folks, and under usual conditions, they render a random sample of Y(1) or Y(0), which in expectation retrieves the ATE.  But when there is pure guessing, the guess by subject i is not centered around Y_i(1) in the treatment group or Y_i(0) in the control group.  Instead, it is centered on some other value that is altogether unresponsive to treatment. 

Now that we understand the consequences of inattention, how do we deal with it?

We could deal with inattentive responding under compliance, but it is useful to separate compliance with the treatment protocol, which can be just picking up the phone, from attention or sincerity with which the respondent responds to the dependent variables. On a survey experiment, compliance plausibly adequately covers both, but cases where treatment and measurement are de-coupled, e.g., happen at different times, it is vital to separate the two.

On survey experiments, I think it is reasonable to assume that:

  1. the proportion of people paying attention are the same across Control/Treatment group, and
  2. there is no correlation between who pays attention and assignment to the control group/treatment group, e.g., men are inattentive in the treatment group and women in the control group.

If the assumptions hold, then the worst we get is an estimate on the attentive subset (principal stratification). To get at ATE with the same research design (and if you measure attention pre-treatment), we can post-stratify after estimating the treatment effect on the attentive subset and then re-weight to account for the inattentive group. (One potential issue with the scheme is that variables used to stratify may have a fair bit of measurement error among inattentive respondents.)

The experimental way to get at attenuation would be to manipulate attention, e.g., via incentives, after the respondents have seen the treatment but before the DV measurement has begun. For instance, see this paper.

Attenuation is one thing, proper standard errors another. People responding randomly will also lead to fatter standard errors, not just because we have fewer respondents but because as Ed Haertel points out (in personal communication):

  1. “The variance of the random responses could be [in fact, very likely is: GS] different [from] the variances in the compliant groups.”
  2. Even “if the variance of the random responses was zero, we’d get noise because although the proportions of random responders in the T and C groups are equal in expectation, they will generally not be exactly the same in any given experiment.”

Wanted: Effects That Support My Hypothesis

8 May

Do survey respondents account for the hypothesis that they think people fielding the survey have when they respond? The answer, according to Mummolo and Peterson, is not much.

Their paper also very likely provides the reason why—people don’t pay much attention. Figure 3 provides data on manipulation checks—the proportion guessing the hypothesis being tested correctly. The change in proportion between control and treatment ranges from -.05 to .25, with a bulk of changes in Qualtrics between 0 and .1. (In one condition, authors even offer an additional 25 cents to give a result consistent with the hypothesis. And presumably, people need to know the hypothesis before they can answer in line with it.) The faint increase is especially noteworthy given that on average, the proportion of people in the control group who guess the hypothesis correctly—without the guessing correction—is between .25–.35 (see Appendix B; pdf).

So, the big thing we may have learned from the data is how little attention survey respondents pay. The numbers obtained here are similar to those in Appendix D of Jonathan Woon’s paper (pdf). The point is humbling and suggests that we need to: a) invest more in measurement, and b) have yet larger samples, which is an expensive way to overcome measurement error—a point Gelman has made before.

There is also the point about the worthiness of including ‘manipulation checks.’ Experiments tell us ATE of what we manipulate. The role of manipulation checks is to shed light on ‘compliance.’ If conveying experimenter demand clearly and loudly is a goal, then the experiments included probably failed. If the purpose was to know whether clear but not very loud cues about ‘demand’ matter—and for what it’s worth, I think it is a very reasonable goal; pushing further, in my mind, would have reduced the experiment to a tautology—the paper provides the answer.

Error Free Multi-dimensional Thinking

1 May

Some recent research suggests that Americans’ policy preferences are highly constrained, with a single dimension able to correctly predict over 80% of the responses (see Jessee 2009, Tausanovitch and Warshaw 2013). Not only that, adding a new (orthogonal) dimension doesn’t improve prediction success by more than a couple of percentage points.

All this flies in the face of conventional wisdom in American Politics, which is roughly antipodal to the new view: most people’s policy preferences are unstructured. In fact, many people don’t have any real preferences on many of the issues (`non-preferences’). Evidence that is most often cited in support of this view comes from Converse – weak correlation between preferences across measurement waves spanning two years (r ~ .4 to .5), and even lower within wave cross-issue correlations (r ~ .2).

What explains this double disagreement — over the authenticity of preferences, and over the structuration of preferences?

First, the authenticity of preferences. When reports of preferences change across waves, is it a consequence of attitude change or non-preferences or measurement error? In response to concerns about long periods between test-retest – which allowed for opinions to genuinely change – researchers tried shorter time periods. Correlations were notably stronger (r ~ .6 to .9)(see Brown 1970). But the sheen of these healthy correlations was worn off by concerns that stability was merely an artifact of people remembering and reproducing what they put down last time.

Redemption of correlations over longer time periods came from Achen (1975). While few of the assumptions behind the redemption are correct – notably uncorrelated errors (across individuals, waves, etc.) – for inferences to be seriously wrong, much has to go wrong. More recently, and, subject to validation, perhaps more convincingly, work by Dean Lacy suggests that once you take out the small number of implausible transitions between waves – those from one end of the scale to another – cross-wave correlations are fairly healthy. (This is exactly opposite to the conclusion Converse came to based on a Markov model; he argued that aside from a few consistent responses, rest of the responses were mostly noise.) Much simpler but informative tests are still missing. For instance, it seems implausible that lots of people who hold well-defined preferences on an issue would struggle to pick even the right side of the scale when surveyed. Tallying stability of dichotomized preferences would be useful.

Some other purported evidence for the authenticity of preferences has come from measurement error models that rest upon more sizable assumptions. These models assume an underlying trait (or traits) and pool preferences over disparate policy positions (see, for instance, Ansolabehere, Rodden, and Snyder 2006 but also Tausanovitch and Warshaw 2013). How do we know there is an underlying trait? That isn’t clear. Generally, it is perfectly okay to ask whether preferences are correlated, less so to simply assume that preferences are structured by an unobserved underlying mental construct.

With the caveat that dimensions may not reflect mental constructs, we next move to assessing claims about the dimensionality of preferences. Differences between recent results and conventional wisdom about “constraint” may be simply due to increase in structuration of preferences over time. However, research suggests that constraint hasn’t increased over time (Baldassari and Gelman 2008). Perhaps more plausibly, dichotomization, which presumably reduces measurement error, is behind some of the differences. There are of course less ham-handed ways of reducing measurement error. For instance, using multiple items to measure preferences on a single policy, as psychologists often do. Since it cannot be emphasized enough, the lesson of past two paragraphs is: keep adjustments for measurement error, and measurement of constraint separate.

Analysis suggesting higher constraint may also be an artifact of analysts’ choices. Dimension reduction techniques are naturally sensitive to the pool of items. If a large majority of the items solicit preferences on economic issues (as in Tausanovitch and Warshaw 2013), the first principal component will naturally pick preferences on that dimension. Since the majority of the gains would come from correctly predicting a large majority of the items, gains in percentage correctly predicted would be poor at judging whether there is another dimension, say preferences on cultural issues. Cross-validation across selected large item groups (large enough to overcome idiosyncratic error) would be a useful strategy. And then again, gains in percentage correctly predicted over the entire population may miss subgroups with very different preference structures. For instance, Blacks and Catholics, who tend to be more socially conservative but economically liberal. Lastly, it is possible that preferences on some current issues (such as those used by Jessee 2009) may be more structured (by political conflict) than some old standing issues.

Why Were the Polls so Accurate?

16 Nov

The Quant. Interwebs have overflowed with joy since the election. Poll aggregation works. And so indeed does polling, though you won’t hear as much about it on the news, which is likely biased towards celebrity intellects than the hardworking many. But why were the polls so accurate?

One potential explanation: because they do some things badly. For instance, most fail at collecting “random samples” these days, because of a fair bit of nonresponse bias. This nonresponse bias, if correlated with the propensity to vote, may actually push up the accuracy of the vote choice means. There are a few ways to check this theory.

One way to check this hypothesis: were the results from polls using Likely Voter screens different from those not using them? If not, why not? From the Political Science literature, we know that people who vote (not just those who say they vote) do vary a bit from those who do not vote, even on things like vote choice. For instance, there is just a larger proportion of `independents’ among them.

Other kinds of evidence will be in the form of failure to match population or other benchmarks. For instance, election polls would likely fare poorly when predicting how many people voted in each state. Or tallying up Spanish language households or number of registered. Another way of saying this is that the bias will vary by what parameter we aggregate from these polling data.

So let me reframe the question: how do polls get election numbers right even when they undercount Spanish speakers? One explanation is that there is a positive correlation between selection into polling, and propensity to vote, which makes vote choice means much more reflective of what we will see come election day.

The other possible explanation to all this – post-stratification or other posthoc adjustment to numbers, or innovations in how sampling is done: matching, stratification etc. Doing so uses additional knowledge about the population and can shrink s.e.s and improve accuracy. One way to test such non-randomness: over tight confidence bounds. Many polls tend to do wonderfully on multiple uncorrelated variables, for instance, census region proportions, gender, … etc., something random samples cannot regularly produce.

Poor Browsers and Internet Surveys

14 Jul

Given,

  1. older browsers are likelier to display the survey incorrectly.
  2. type of browser can be a proxy for respondent’s proficiency in using computers, and speed of the Internet connection.

People using older browsers may abandon surveys at higher rates than those using more modern browsers.

Using data from a large Internet survey, we test whether people who use older browsers abandon surveys at higher rates, and whether their surveys have larger amount of missing data. Read More >>.

GSS and ANES: Alike Yet Different

1 Jan

The General Social Survey (GSS), run out of National Opinion Research Center at University of Chicago, and American National Election Studies (ANES), which until recently ran out of University of Michigan’s Institute for Social Research, are two preeminent surveys tracking over-time trends in social and political attitudes, beliefs and behavior of the US adult population.

Outside of their shared Midwestern roots, GSS and ANES also share sampling design—both use a stratified random sample, with the selection of PSUs affected by necessities of in-person interviewing, and during the 1980s and 1990s, sampling frame. However, in spite of this relative close coordination in sampling, common mode of interview, responses to few questions asked identically in the two surveys diverge systematically.

In 1996, 2000, 2004, and 2008, GSS and ANES included exact same questions on racial trait ratings. Limiting the sample to just White respondents, mean difference in trait ratings of Whites and Blacks was always greater in ANES – ratings of hardwork and intelligence, almost always statistically significantly so.

Separately, difference in proportion of self-identified Republicans estimated by ANES and GSS is declining over time.

This unexplained directional variance poses a considerable threat to inference. The problem takes additional gravity given that the surveys are the bedrock of important empirical research in social science.

Another Coding Issue in the ANES Cumulative File

29 Dec

Technology has made it easy to analyze data. However, we have paid inadequate attention to developing automation in data analysis software that pays more attention to potential problems with the data itself. For example, I was recently exploring how interviewer rated political knowledge varied by respondent’s level of education within each year over time using ANES cumulative file. It was only when I plotted the confidence bounds (not earlier) that I found that in 2004 7-category education variable (VCF0140a) had fewer than 7 levels—a highly unlikely scenario. To verify, I checked the number of unique levels of education in 2004 and indeed there were only 5.


unique(nes$vcf0140a[nes$vcf0004=="2004"])
[1] 6 5 2 3 1

The variable from which the 7-category variable is ostensibly constructed (V043254) in 2004 has 8 levels. Since the plot looks reasonable for 2004, the problem was likely due to the case of (unwarranted) collapsing of adjacent categories than switching order more irresponsibly. Tallying raw counts revealed that categories 6 and 7, 0 and 1, and 4 and 5 had been collapsed.

On to the point about developing software that automatically flags potential problems. It would be nice if the software flagged differing number of levels of the same variable by year. However, this suggestion is piecemeal and more careful thinking ought to be brought to bear to design issues.

The Perils of Balancing Scales

15 Nov

Randomization of scale order (balancing) across respondents is common practice. It is done to ‘cancel’ errors generated by ‘satisficers’ who presumably pick the first option on a scale without regard to content. The practice is assumed to have no impact on the propensity of satisficers to pick the first option, or on other respondents, both somewhat unlikely assumptions.

A far more reasonable hypothesis is that reversing scale order does have an impact on respondents, on both non-satisficers and satisficers. Empirically, people take longer to fill out reverse ordered scales, and it is conceivable that they pay more attention to filling out the responses — reducing satisficing and perhaps boosting the quality of responses, either way not simply ‘canceling’ errors among a subset, as hypothesized.

Within satisficers, without randomization, correlated bias may produce artificial correlations across variables where none existed. For example, satisficers (say uneducated) love candy (love candy to hate candy scale). Such a calamity ought to be avoided. However, in a minority of cases where satisficers true preferences are those expressed in the first choice, randomization will artificially produce null results. Randomization may be more sub-optimal still if there indeed are effects on rest of the respondents.

Within survey experiments, where balancing randomization is “orthogonal” (typically just separate) to the main randomization, it has to be further assumed that manipulation has equal impact on “satisficers” in either reverse or regularly ordered scale, again a somewhat tenuous assumption.

The entire exercise of randomization is devoted not to find out the true preferences of the satisficers, a more honorable purpose, but to eliminate them from the sample. There are better ways to catch ‘satisficers’ than randomizing across the entire sample. One possibility is to randomize within a smaller set of likely satisficers. On knowledge questions, ability estimated over multiple questions can be used to inform propensity the first option (if correct and if chosen) was not a guess. Response latency can be used as well to inform judgments. For attitude questions, follow up questions measuring the strength of attitude etc. can be used to weight responses on attitude questions.

If we are interested in getting true attitudes from ‘satisficers,’ we may want to motivate respondents either by interspersed exhortations that their responses matter, or by providing financial incentives.

Lastly, it is important to note that combining two kinds of systematic error doesn’t make it a ‘random’ error. And no variance in data can be a conservative attribute of data (with hardworking social scientists around).