Time Will Tell

23 Nov

Part of empirical social science is about finding fundamental truths about people. It is a difficult enterprise partly because scientists only observe data in a particular context. Neither cross-sectional variation nor data that goes back at best by tens of years is often enough to come up with generalizable truths. Longer observation windows help clarify what is an essential truth and what is, at best, a contextual truth. 

Support For Racially Justified and Targeted Affirmative Action

Sniderman and Carmines (1999) find that a large majority of Democrats and Republicans oppose racially justified and targeted affirmative action policies. They find that opposition to racially targeted affirmative action is not rooted in prejudice. Instead, they conjecture that it is rooted in adherence to the principle of equality. The authors don’t say it outright but the reader can surmise that in their view, opposition to racially justified and targeted affirmative action is likely to be continued and broad-based. It is a fair hypothesis. Except 20 years later, a majority of Democrats support racially targeted and racially justified affirmative action in education and hiring (see here).

What’s the Matter with “What’s the Matter with What’s the Matter with Kansas”?

It isn’t clear Bartels was right about Kansas even in 2004 (see here) (and that isn’t to say Thomas Frank was right) but the thesis around education has taken a nosedive. See below.

Split Ticket Voting For Moderation

On the back of record split ticket voting, Fiorina (and others) theorized “divided government is the result of a conscious attempt by the voters to achieve moderate policy.” Except very quickly split ticket voting declined (with of course no commensurate radicalization of the population) (see here).

Effect of Daughters on Legislator Ideology

Having daughters was thought to lead politicians to vote more liberally (see here) but more data suggested that this stopped in the polarized era (see here). Yet more data suggested that there was no trend for legislators with daughters to vote liberally before the era covered by the first study (see here).

Striking Changes Among Democrats on Race and Gender

10 Nov

The election of Donald Trump led many to think that Republicans have changed, especially on race related issues. But the data suggest that the big changes in public opinion on racial issues over the last decade or so have been among Democrats. Since 2012, Democrats have become strikingly more liberal on race, on issues related to women, and the LGBT over the last decade or so.

Conditions Make It Hard for Blacks to Succeed

The percentage of Democrats strongly agreeing with the statement more than doubled between 2012 (~ 20%) and 2020 (~ 45%).

Source: ANES

Affirmative Action in Hiring/Promotion

The percentage of Democrats for affirmative action for Blacks in hiring/promotion nearly doubled between 2012 (~ 26%) and 2020 (~ 51%).

Source: ANES

Fun fact: Support for caste based and gender based reservations in India is ~4x+ higher than support for race based Affirmative Action in the US. See here.

Blacks Should Not Get Special Favors to Get Ahead

The percentage of Democrats strongly disagreeing with the statement nearly tripled between 2012 (~ 13%) and 2020 (~ 41%).

Source: ANES

See also Sniderman and Carmines who show that support for the statement is not rooted in racial prejudice.

Feelings Towards Racial Groups

Democrats in 2020 felt more warmly toward Blacks, Hispanics, and Asians than Whites.

Source: ANES

White Democrats’ Feelings Towards Various Racial Groups

White Democrats in 2020 felt more warmly toward Asians, Blacks, and Hispanics than Whites.

Democrats’ Feelings Towards Gender Groups

Democrats felt 15 points more warmly toward feminists and LGBT in 2020 than in 2012.

Source: ANES

How Numerous Are the Numerate?

14 Feb

I recently conducted a survey on Lucid and posed a short quiz to test basic numeracy:

  • A man writes a check for $100 when he has only $70.50 in the bank. By how much is he overdrawn? — $29.50, $170.50, $100, $30.50
  • Imagine that we roll a fair, six-sided die 1000 times. Out of 1000 rolls, how many times do you think the die would come up as an even number? — 500, 600, 167, 750
  • If the chance of getting a disease is 10 percent, how many people out of 1,000 would be expected to get the disease? — 100, 10, 1000, 500
  • In a sale, a shop is selling all items at half price. Before the sale, the sofa costs $300. How much will it cost on sale? — $150, $100, $200, $250
  • A second-hand car dealer is selling a car for $6,000. This is two-thirds of what it cost new. How much did the car cost new? — $9,000, $4,000, $12,000, $8,000
  • In the BIG BUCKS LOTTERY, the chances of winning a $10 prize are 1%. What is your best guess about how many people would win a $10 prize if 1000 people each buy a single ticket from BIG BUCKS? — 10, 1, 100, 50

I surveyed 800 adult Americans. Of the 800, only 674 respondents (about 84%) cleared the attention check—a question designed to test if the respondents were paying attention or not. I limit the analysis to these 674 respondents.

A caveat before the results. I do not adjust the scores for guessing.

Of these respondents, just about a third got all the answers correct. Another quarter got 5 out of 6 correct. Another 19% got 4 out of 6 right. The remaining 20% got 3 or fewer questions right. The table below enumerates the item-wise results.

ItemProportion Correct
Sofa Sale.97

The same numbers are plotted below.

p.s. You may be interested in reading this previous blog based on MTurk data.

Data Police

13 Mar

In a new paper, Chohlas-Wood et al. present three interesting points:

  1. Some of the major policing strategies have scant empirical support:
    • The impact of “pulling over drivers for minor traffic violations” (for the alleged purpose of “[preventing] criminal activity by intercepting individuals driving to and from the scene of a crime”) in Nashville was ~ 0 on serious crimes. (See Figures 1 and 2). To get a sense of the scale of the intervention: “In 2012, the MNPD conducted traffic stops up to ten times more frequently per capita than police departments in similar U.S. cities.”
    • The impact of stop and frisk in NYC on serious crime was also ~ 0. Again, to get a sense of the scale of the policy: “NYPD officers reported conducting nearly 700,000 Terry stops in 2011 alone, nearly 90% of which involved Black or Hispanic pedestrians.”
    • GS: None of this is terribly surprising. All over the world, very few policies are chosen as a result of careful data analysis. Why would policing be any different? My other prior based on looking at a fair bit of US crime data is that to a first approximation, all trends are national. When policing is local and trends are national, it suggests that the way policing is done is perhaps not the most important factor in preventing crime.
  2. Racial bias in who is stopped:
    • “[A]t any given level of risk Black and Hispanic individuals were frisked considerably more often than white individuals.” (NYC, 2011-2012)
    • “[T]he rates at which frisks recover weapons are significantly lower for frisked Black individuals (3.8%) and Hispanic individuals (3.4%) compared to white individuals (5.7%).” (From the Chicago Police Department (CPD) in 2017)
    • Contraband recovery rate for Blacks = 17%, Hispanics = 20%, Whites = 27% (Chicago 2014–2019, traffic stops.)
    • Contraband recovery rate for Blacks = 24%, Hispanics = 23%, Whites = 34% (Philadelphia 2014–2019; traffic stops.)
    • GS: I am impressed by the contraband recovery rates. Either the base rate of ‘contraband’ is super high or the police is very good. My hunch is the former but would love to see data. (See below.)
    • GS: If police select who to stop based on observable characteristics (conditional on location; what else can they rely on?), criminals may be incentivized to game that reducing the value of observables over time.
  3. Whack-a-mole nature of policing policies
    • “The settlement agreement with the ACLU took effect on January 1, 2016.85 For 2016, the CPD reported a total of approxi-mately 100,000 pedestrian stops, a sharp drop from the roughly 600,000 stops reported for 2015 (Figure 9).86 At the same time, the number of traffic stops made by the CPD began to rise. The CPD reported around 100,000 traffic stops in 2014 and a similar amount in 2015, but by 2019, the CPD reported nearly 600,000 traffic stops, with large increases occurring each year from 2016 to 2019. These traffic stops came to closely resemble the pedestrian stops that the CPD was contemporaneously under pressure to curtail. …”
    • Following a consent decree and settlement in 2011, pedestrian stops fell from more than 200,000 reported stops in 2014 (the earliest year for which we have data released publicly by the city) to fewer than 100,000 reported stops in each of 2018 and 2019, while traffic stops almost doubled in the same period”

p.s. Graham sends this:

“Back in the 1990s, it looked like the Supreme Court was going to run drug checkpoints, so Indianapolis started doing one. Drivers were stopped completely at random until the Supreme Court put an end to it.

The city conducted six such roadblocks between August and November that year, stopping 1,161 vehicles and arresting 104 motorists. Fifty-five arrests were for drug-related crimes, while 49 were for offenses unrelated to drugs. The overall “hit rate” of the program was thus approximately nine percent.

If you take this as a baseline, police are twice as good at finding contraband as random selection. If “contraband” just means drugs, then probably four times as good. So the baseline rate of contraband is high (a surprising number of people have warrants, drugs, and weapons) but police are also beating the odds.”

Chicago is not Indianapolis and 2015 is not 2000 but still valuable.

p.p.s. Graham also highlights an issue with Figure 2. Chohlas-Wood et al. plot the murder rate per 1k on the same graphs as vehicle stops per 1k. This naturally squishes the variation in the murder rate. The general rule is that you should avoid plotting variables that vary by orders of magnitude on the same graph. At any rate, doing so gives the appearance that the authors are putting a thumb on the scale.

Interpreting Data

26 Sep

It is a myth that data speaks for itself. The analyst speaks for the data. The analyst chooses what questions to ask, what analyses to run, and how they are interpreted and summarized.

I use excerpts from a paper by Gilliam et al. on the media portrayal of crime to walk through one set of choices made by a group of analysts. (The excerpts also highlight the need for reading a paper fully than relying on the abstract alone.)


From Gilliam et al.; Abstract.

White Violent Criminals Are Overrepresented

From Gilliam et al.; Bottom of page 10.

White Nonviolent Criminals Are Overrepresented

From Gilliam et al.; first paragraph on page 12

Relative Underrepresentation Between Violent and Nonviolent Crime is a Problem

From Gilliam et al.; Last paragraph on page 12
From Gilliam et al.; First paragraph on page 13

Compare the above with the following figure and interpretation from Reaching Beyond Race by Sniderman and Carmines. Rather than focus on the middle two peaks: 28 vs. 43, Sniderman and Carmines write: “we were struck by the relative absence of racial polarization.” (Added on 10/4/2023)

Partisan Morality

11 Jun

Sinn Féin and Fianna Fáil have said that activists posed as members of a polling company and went door-to-door to canvass the opinions of voters.


The rationale is simple. If you pose as an SF worker, you are likely to be met with shut doors or opinions in favor of SF got under slight duress. Is it a bridge too far or is it a harmless lie? More generally, do we use the same moral reasoning paradigm for violations by co-partisans and opposing partisans? My hunch is that for such kinds of violations we use a deontological framework for opposing partisans and a consequentialist one for co-partisans. The framework we use may switch depending on the circumstance. One way to test it would be to do a survey experiment with the above news article, switching parties. To get a better baseline, it may be useful to do three conditions: party_a, party_b, consumer_brand, e.g., Coke, etc.

The Hateful ATE: The Effect of Affective Polarization

7 Jun

In a new paper, Broockman et al. use a clever manipulation to induce “three decades of change in affective polarization”:

In typical trust games, there are two players. Player 1 receives a cash allocation and is instructed to give “some, all, or none” of the money to Player 2. The player is also told that the researchers will triple any amount Player [1] gives to Player 2 and that Player 2 can return some, all, or none of the money back to Player 1. Therefore, the more Player 1 expects reciprocity from Player 2, the more money they should allocate to Player 2 in anticipation they will receive a larger sum in return, and the better off Player 2 will be. For example, if Player 1 gives all her money to Player 2, this sum would be tripled, and Player 2 could return half of the tripled amount to Player 1—leaving both players with 50% more than Player 1’s initial allocation. But if Player 1 gives no money to Player 2, Player 1 leaves with only her initial allocation and Player 2 leaves with nothing.

First, we always make participants take the role of Player 2. This means they always first observe an allocation another player makes to them. Second, across three consecutive rounds of game play, participants are told they are interacting with three other respondents of the opposite political party who have each been allocated $10. However, they are in fact are interacting with computerized opponents who offer allocations based on a pre-determined script. Participants randomized to the Positive Experience condition receive allocations from Player 1 of $8, $7 and $8 (tripled to $24, $21 and $24) respectively across the three rounds of the game. However, those in the Negative Experience condition receive $0 allocations in all three rounds.

Broockman et al. 2021

Next, comes the punchline. “Player 1’s reason for their allocation to you: your partisanship (all rounds), your income (Round 2)”. See Page 65.

Being told that a co- or opposing- partisan gave $0 versus being told that they gave $8, $7, and $8 because of your partisanship across three rounds has a dramatic effect on partisans’ feelings: partisans’ feelings toward opposing partisans become ‘cooler,’ it doesn’t affect their feelings towards co-partisans (impressive), and (strangely) polarizes their feelings toward elites (see the figure below).

Three comments are in order.

First, the manipulation is unrealistic given previous effect sizes (see here).“The average amount allocated to copartisans in the trust game was $4.58 (95% confidence interval [4.33, 4.83]), representing a “bonus” of some 10% over the average allocation of $4.17.”

Second, the manipulation principally ought to change perceptions of how trusting people are and not how trustworthy they are. We don’t manipulate how deceitful the other person is but how fearful they are of not having their actions reciprocated. Disliking less trusting people is slightly weird and plausibly points to how the underlying antipathy can be exacerbated by treatments that do not present a clear reason for judging another person more harshly. Or it could be that not being seen as being trustworthy and losing out on money as a result of it is insulting and aggravating.

Whatever the reason, generalizing from a bad personal interaction to all other members of a group is disturbing. (The fact that treatment cools people’s feelings toward opposing partisans suggests people expect better from them, which is interesting.) Ascribing feelings from a bad personal experience to elites seems odder (and more disturbing) still.

The absence of commensurate co- and opposing- partisan feeling panels for elites feels odd.

The paper finds that having a “bad” personal experience (vis-a-vis a better one) with an opposing partisan increases interpersonal animus (plus polarization of feelings toward partisan elites) but doesn’t cause partisans to like opposing partisan MCs less or co-partisan MCs more (though see above. Note that the pooled estimate for the opposing party is 1.5% or so—which is about what I would expect; it likely deserves another run at the bank). (I didn’t understand the change from co-partisan and opposing-partisan MCs to “own MCs” in the next analysis, so I am omitting that.) The paper discusses other DVs: 

  1. Interest in expressing party-consistent issue preferences (no effect)
  2. Support for bi-partisan legislation (~ more in favor)
  3. Opposition to democratic norms (pooled index seems to move by d = .09 and is nearly sig. at conventional levels). (I make a special reference to the index because presumably it has the least measurement error and is least likely to show an idiosyncratic pattern given sample size. There is also a small point about how multiple comparison adjustments are made—plausibly they should account for measurement error.)
  4.  Endorsement of partisan-congenial claims (Ds yes; Rs no)

The theorized path from bad personal experience with a co- (or opposing) partisan to opposition to democratic norms, etc., seems convoluted to me. So let’s unpack the theoretical underpinnings of the expectations. Interpersonal animus among partisans is an indicator of affective polarization. And the experiment successfully manipulates interpersonal animus. So what’s the issue? One escape hatch is that the concept is not uni-dimensional. Another is that any increase in interpersonal affect manifests in political consequences only over long periods as it causes people to watch different media, trust different things, etc.

This Time It’s Different: Polarization of the American Polity

10 Jan

In a new paper, Pierson and Shickler contend that this era of polarization is different. They fear that polarization this time will continue to intensify because the three “meso-institutions”—interest groups, state parties, and the media—that were the bulwark against polarization in earlier eras are themselves polarized or have changed in ways that they offer much less resistance:

  1. State Parties
    • State Parties Have Polarized “state party platforms are more similar across states and more distinctive across parties than in earlier eras (Paddock 2005, 2014; Hopkins & Schickler 2016).”
    • Federal Government is Much Bigger. This means state concerns matter less — which brought cross-cutting cleavages into play. “Although it has received less discussion in the analysis of polarization, a second development in the 1960s and early 1970s—what Skocpol (2003, p. 135) has termed the “long 1960s”—was also critical: a dramatic expansion and centralization of public policy (Melnick 1994, Pierson 2007, Jones et al. 2019). Civil rights legislation was only the entering wedge. During the long 1960s, liberal Congresses enacted, often on a bipartisan basis, major new domestic spending programs (especially Medicaid and Medicare, which now account for roughly a quarter of federal spending as well as, in the case of Medicaid, a big share of state spending). They greatly enlarged the regulatory state, creating powerful new federal agencies (such as the Environmental Protection Agency) and enacting extensive rules covering environmental and consumer protection as well as workplace safety.”
  2. Interest Groups Have Polarized
    • “The powerful US Chamber of Commerce provides a striking illustration of the broader trend. Traditionally conservative but studiously nonaligned, it now carefully coordinates its extensive electoral activities with the Republican Party, and its political director (a former GOP operative) can refer unselfconsciously to Republican Senate candidates as “our ticket” (Hacker & Pierson 2016).”
  3. Media —- the usual story

Why This Time is Different

  • “The Civil War era represents an obvious extreme point in the intensity of divisions, yet the period of partisan polarization was remarkably brief: The major American parties featured deep internal divisions on slavery up until the mid-to-late 1850s, and the new Republican majority became deeply divided over Reconstruction and key economic questions soon after the war ended.”

Questions and Notes

  • Why are business interest groups not more bipartisan? For instance, if the US Chambers of Commerce is going hard R, is it a sign that it represents businesses of a particular sector/region? Is the consolidation of the economy (GDP) in cities causing this? If so, then how does the oncoming WFH change affect these things?
  • Given wide swings in policy regimes are expensive for business—for one, they cannot plan, what are the kinds of plays eventually big businesses will come up with. In some ways, for instance, Twitter banning Trump is predictable. Businesses will opt for stability where they can.
  • The more frightening turn in American politics is toward populism and identity politics—so much for the end of politics.
  • The party coalitions keep evolving. For instance, in 2020, poor White people were firmly in the column of Republicans. While as late as 2004, as Bartels pointed out, they were not.

No Props for Prop 13

14 Dec

Proposition 13 enacted two key changes: 1. it limited property tax to 1% of the cash value, and 2. limited annual increase of assessed value to 2%. The only way the assessed value can change by more than 2% is if the property changes hands (a loophole allows you to change hands without officially changing hands). 

One impressive result of the tax is the inequality in taxes. Sample this neighborhood in San Mateo where taxes range from $67 to nearly $300k.

Take out the extremes, and the variation is still hefty. Property taxes of neighboring lots often vary by well over $20k. ) My back-of-the-envelope estimate of standard deviation based on ten properties chosen at random is $23k.)

Sample another from Stanford where the range is from ~$2k to nearly $59k.

Prop. 13 has a variety of more material perverse consequences. Property taxes are one reason by people move from their suburban houses near the city to other more remote, cheaper places. But Prop. 13 reduces the need to move out. This likely increases property prices, which in turn likely lowers economic growth as employers choose other places. And as Chaste, a long-time contributor to the blog points out, it also means that the currently employed often have to commute longer distances, which harms the environment in addition to harming the families of those who commute.

p.s. Looking at the property tax data, you see some very small amounts. For instance, $19 property tax. When Chaste dug in, he found that the property was last sold in 1990 for $220K but was assessed at $0 in 2009 when it passed on to the government. The property tax on government-owned properties and affordable housing in California is zero. And Chaste draws out the implication: “poor cities like Richmond, which are packed with affordable housing, not only are disproportionately burdened because these populations require more services, they also receive 0 in property taxes from which to provide those services.”

p.p.s. My hunch is that a political campaign that uses property taxes in CA as a targeting variable will be very successful.

p.p.p.s. Chaste adds: “Prop 13 also applies to commercial properties. Thus, big corps also get their property tax increases capped at 2%. As a result, the sales are often structured in ways that nominally preserve existing ownership.

There was a ballot proposition on the November 2020 ballot, which would have removed Prop 13 protections for commercial properties worth more than $3M. Residential properties over $3M would continue to enjoy the protection. Even this prop failed 52%-48%. People were perhaps scared that this would be the first step in removing Prop 13 protections for their own homes.”

Dismissed Without Prejudice: Evaluating Prejudice Reduction Research

25 Sep

Prejudice is a blight on humanity. How to reduce prejudice, thus, is among the most important social scientific questions. In the latest assessment of research in the area, a follow-up to the 2009 Annual Review article, Betsy Paluck et al., however, paint a dim picture. In particular, they note three dismaying things:

Publication Bias

Table 1 (see below) makes for grim reading. While one could argue that the pattern is explained by the fact that lab research tends to have smaller samples and has especially powerful treatments, the numbers suggest—see the average s.e. of the first two rows (it may have been useful to produce a $sqrt(1/n)$ adjusted s.e.)—that publication bias very likely plays a large role. It is also shocking to know that just a fifth of the studies have treatment groups with 78 or more people.

Light Touch Interventions

The article is remarkably measured when talking about the rise of ‘light touch’ interventions—short exposure treatments. I would have described them as ‘magical thinking’ for they seem to be founded in the belief that we can make profound changes in people’s thinking on the cheap. This isn’t to say light-touch interventions can’t be worked into a regime that affects profound change—repeated light touches may work. However, as far as I could tell, no study tried multiple touches to see how the effect cumulates.

Near Contemporaneous Measurement of Dependent Variables

Very few papers judged the efficacy of the intervention a day or more after the intervention. Given the primary estimate of interest is longer-term effects, it is hard to judge the efficacy of the treatments in moving the needle on the actual quantity of interest.   

Beyond what the paper notes, here are a couple more things to consider:

  1. Perspective getting works better than perspective-taking. It would be good to explore this further in inter-group settings.
  2. One way to categorize ‘basic research interventions’ is by decomposing the treatment into its primary aspects and then slowly building back up bundles based on data:
    1. channel: f2f, audio (radio, etc.), visual (photos, etc.), audio-visual (tv, web, etc.), VR, etc.
    2. respondent action: talk, listen, see, imagine, reflect, play with a computer program, work together with someone, play together with someone, receive a public scolding, etc.
    3. source: peers, strangers, family, people who look like you, attractive people, researchers, authorities, etc.
    4. message type: parable, allegory, story, graph, table, drama, etc.
    5. message content: facts, personal stories, examples, Jonathan Haidt style studies that show some of the roots of our morality are based on poor logic, etc.

The (Mis)Information Age: Provenance is Not Enough

31 Aug

The information age has bought both bounty and pestilence. Today, we are deluged with both correct and incorrect information. If we knew how to tell apart correct claims from incorrect, we would have inched that much closer to utopia. But the lack of nous in telling apart generally ‘obvious’ incorrect claims from correct claims has brought us close to the precipice of disarray. Thus, improving people’s ability to identify untrustworthy claims as such takes on urgency.


Inferring the Quality of Evidence Behind the Claims: Fact Check and Beyond

One way around misinformation is to rely on an expert army that assesses the truth value of claims. However, assessing the truth value of a claim is hard. It needs expert knowledge and careful research. When validating, we have to identify with which parts are wrong, which parts are right but misleading, and which parts are debatable. All in all, it is a noisy and time-consuming process to vet a few claims. Fact check operations, hence, cull a small number of claims and try to validate those claims. As the rate of production of information increases, thwarting misinformation by checking all the claims seems implausibly expensive.

Rather than assess the claims directly, we can assess the process. Or, in particular, the residue of one part of the process for making the claim—sources. Except for claims based on private experience, e.g., religious experience, claims are based on sources. We can use the features of these sources to infer credibility. The first feature is the number of sources cited to make a claim. All else equal, the more number of sources saying the same thing, the greater the chances that the claim is true. None of this is to undercut a common observation: lots of people can be wrong about something. A harder test for veracity if a diverse set of people say the same thing. The third test is checking the credibility of the sources.

Relying on the residue is not a panacea. People can simply lie about the source. We want the source to verify what they have been quoted as saying. And in the era of cheap data, this can be easily enabled. Quotes can be linked to video interviews or automatic transcriptions electronically signed by the interviewee. The same system can be scaled to institutions. The downside is that the system may prove onerous. On the other hand, commonly, the same source is cited by many people so a public repository of verified claims and evidence can mitigate much of the burden.

But will this solve the problem? Likely not. For one, people can still commit sins of omission. For two, they can still draft things in misleading ways. For three, trust in sources may not be tied to correctness. All we have done is built a system for establishing provenance. And establishing the provenance is not enough. Instead, we need a system that incentivizes both correctness and presentation that makes correct interpretation highly likely. It is a high bar. But it is the right bar—correct and liable to be correctly interpreted.

To create incentives for publishing correct claims, we need to either 1. educate the population, which brings me to the previous post, or 2. find ways to build products and recommendations that incentivize correct claims. We likely need both.

The (Mis)Information Age: Measuring and Improving ‘Digital Literacy’

31 Aug

The information age has bought both bounty and pestilence. Today, we are deluged with both correct and incorrect information. If we knew how to tell apart correct claims from incorrect, we would have inched that much closer to utopia. But the lack of nous in telling apart generally ‘obvious’ incorrect claims from correct claims has brought us close to the precipice of disarray. Thus, improving people’s ability to identify untrustworthy claims as such takes on urgency.

Before we find fixes, it is good to measure how bad things are and what things are bad. This is the task the following paper sets itself by creating a ‘digital literacy’ scale. (Digital literacy is an overloaded term. It means many different things, from the ability to find useful information, e.g., information about schools or government programs, to the ability to protect yourself against harm online (see here and here for how frequently people’s accounts are breached and how often they put themselves at risk of malware or phishing), to the ability to identify incorrect claims as such, which is how the paper uses it.)

Rather than build a skill assessment kind of a scale, the paper measures (really predicts) skills indirectly using some other digital literacy scales, whose primary purpose is likely broader. The paper validates the importance of various constituent items using variable importance and model fit kinds of measures. There are a few dangers of doing that:

  1. Inference using surrogates is dangerous as the weakness of surrogates cannot be fully explored with one dataset. And they are liable not to generalize as underlying conditions change. We ideally want measures that directly measure the construct.
  2. Variable importance is not the same as important variables. For instance, it isn’t clear why “recognition of the term RSS,” the “highest-performing item by far” has much to do with skill in identifying untrustworthy claims.

Some other work builds uncalibrated measures of digital literacy (conceived as in the previous paper). As part of an effort to judge the efficacy of a particular way of educating people about how to judge untrustworthy claims, the paper provides measures of trust in claims. The topline is that educating people is not hard (see the appendix for the description of the treatment). A minor treatment (see below) is able to improve “discernment between mainstream and false news headlines.”

Understandably, the effects of this short treatment are ‘small.’ The ITT short-term effect in the US is: “a decrease of nearly 0.2 points on a 4-point scale.” Later in the manuscript, the authors provide the substantive magnitude of the .2 pt net swing using a binary indicator of perceived headline accuracy: “The proportion of respondents rating a false headline as “very accurate” or “somewhat accurate” decreased from 32% in the control condition to 24% among respondents who were assigned to the media literacy intervention in wave 1, a decrease of 7 percentage points.” The .2 pt. net swing on a 4 point scale leading to a 7% difference is quite remarkable and generally suggests that there is a lot of ‘reverse’ intra-category movement that the crude dichotomization elides over. But even if we take the crude categories as the quantity of interest, a month later in the US, the 7 percent swing is down to 4 percent:

“…the intervention reduced the proportion of people endorsing false headlines as accurate from 33 to 29%, a 4-percentage-point effect. By contrast, the proportion of respondents who classified mainstream news as not very accurate or not at all accurate rather than somewhat or very accurate decreased only from 57 to 55% in wave 1 and 59 to 57% in wave 2.

Guess et al. 2020

The opportunity to mount more ambitious treatments remains sizable. So does the opportunity to more precisely understand what aspects of the quality of evidence people find hard to discern. And how we could release products that make their job easier.

Another ANES Goof-em-up: VCF0731

30 Aug

By Rob Lytle

At this point, it’s well established that the ANES CDF’s codebook is not to be trusted (I’m repeating “not to be trusted to include a second link!). Recently, I stumbled across another example of incorrect coding in the cumulative data file, this time in VCF0731 – Do you ever discuss politics with your family or friends?

The codebook reports 5 levels:

Do you ever discuss politics with your family or friends?

1. Yes
5. No

8. DK
9. NA

INAP. question not used

However, when we load the variable and examine the unique values:

# pulling anes-cdf from a GitHub repository
cdf <- rio::import("https://github.com/RobLytle/intra-party-affect/raw/master/data/raw/cdf-raw-trim.rds")

## [1] NA  5  1  6  7

We see a completely different coding scheme. We are left adrift, wondering “What is 6? What is 7?” Do 1 and 5 really mean “yes” and “no”?

We may never know.

For a survey that costs several million dollars to conduct, you’d think we could expect a double-checked codebook (or at least some kind of version control to easily fix these things as they’re identified).

Survey Experiments With Truth: Learning From Survey Experiments

27 Aug

Tools define science. Not only do they determine how science is practiced but also what questions are asked. Take survey experiments, for example. Since the advent of online survey platforms, which made conducting survey experiments trivial, the lure of convenience and internal validity has persuaded legions of researchers to use survey experiments to understand the world.

Conventional survey experiments are modest tools. Paul Sniderman writes,

“These three limitations of survey experiments—modesty of treatment, modesty of scale, and modesty of measurement—need constantly to be borne in mind when brandishing term experiment as a prestige enhancer.” I think we can easily collapse these in two — treatment (which includes ‘scale’ as he defines it— the amount of time) and measurement.

Paul Sniderman

Note: We can collapse these three concerns into two— treatment (which includes ‘scale’ as Paul defines it— the amount of time) and measurement.

But skillful artisans have used this modest tool to great effect. Famously, Kahneman and Tversky used survey experiments, e.g., Asian Disease Problem, to shed light on how people decide. More recently, Paul Sniderman and Tom Piazza have used survey experiments to shed light on an unsavory aspect of human decision making: discrimination. Aside from shedding light on human decision making, researchers have also used survey experiments to understand what survey measures mean, e.g., Ahler and Sood

The good, however, has come with the bad; insight has often come with irreflection. In particular, Paul Sniderman implicitly points to two common mistakes that people make:

  1. Not Learning From the Control Group. The focus on differences in means means that we sometimes fail to reflect on what the data in the Control Group tells us about the world. Take the paper on partisan expressive responding, for instance. The topline from the paper is that expressive responding explains half of the partisan gap. But it misses the bigger story—the partisan differences in the Control Group are much smaller than what people expect, just about 6.5% (see here). (Here’s what I wrote in 2016.)
  2. Not Putting the Effect Size in Context. A focus on significance testing means that we sometimes fail to reflect on the modesty of effect sizes. For instance, providing people $1 for a correct answer within the context of an online survey interview is a large premium. And if providing a dollar each on 12 (included) questions nudges people from an average of 4.5 correct responses to 5, it suggests that people are resistant to learning or impressively confident that what they know is right. Leaving $7 on the table tells us more than the .5, around which the paper is written. 

    More broadly, researchers are obtuse to the point that sometimes what the results show is how impressively modest the movement is when you ratchet up the dosage. For instance, if an overwhelming number of African Americans favor Whites who have scored just a few points more than a Black student, it is a telling testament to their endorsement of meritocracy.

Nothing to See Here: Statistical Power and “Oversight”

13 Aug

“Thus, when we calculate the net degree of expressive responding by subtracting the acceptance effect from the rejection effect—essentially differencing off the baseline effect of the incentive from the reduction in rumor acceptance with payment—we find that the net expressive effect is negative 0.5%—the opposite sign of what we would expect if there was expressive responding. However, the substantive size of the estimate of the expressive effect is trivial. Moreover, the standard error on this estimate is 10.6, meaning the estimate of expressive responding is essentially zero.


(Note: This is not a full review of all the claims in the paper. There is more data in the paper than in the quote above. I am merely using the quote to clarify a couple of statistical points.)

There are two main points:

  1. The fact that estimate is close to zero and the s.e. is super fat are technically unrelated. The last line of the quote, however, seems to draw a relationship between the two.
  2. The estimated effect sizes of expressive responding in the literature are much smaller than the s.e. Bullock et al. (Table 2) estimate the effect of expressive responding at about 4% and Prior et al. (Figure 1) at about ~ 5.5% (“Figure 1(a) shows, the model recovers the raw means from Table 1, indicating a drop in bias from 11.8 to 6.3.”). Thus, one reasonable inference is that the study is underpowered to reasonably detect expected effect sizes.

Trump Trumps All: Coverage of Presidents on Network Television News

4 May

With Daniel Weitzel.

The US government is a federal system, with substantial domains reserved for local and state governments. For instance, education, most parts of the criminal justice system, and a large chunk of regulation are under the purview of the states. Further, the national government has three co-equal branches: legislative, executive, and judicial. Given these facts, you would expect news coverage to be broad in its coverage of branches and the levels of government. But there is a sharp skew in news coverage of politicians, with members of the executive branch, especially national politicians (and especially the president), covered far more often than other politicians (see here). Exploiting data from Vanderbilt Television News Archive (VTNA), the largest publicly available database of TV news—over 1M broadcast abstracts spanning 1968 and 2019—we add body to the observation. We searched for references to the president during their presidency and coded each hit as 1. As the figure below shows, references to the president are common. Excluding Trump, on average, a sixth of all articles contain a reference to the sitting president. But Trump is different: 60%(!) of abstracts refer to Trump.

Data and scripts can be found here.

Making an Impression: Learning from Google Ads

31 Oct

Broadly, Google Ads works as follows: 1. Advertisers create an ad, choose keywords, and make a bid (on cost-per-click or CPC) (You can bid on cost-per-view and cost-per-impression also, but we limit our discussion to CPC.), 2. the Google Ads account team vets whether the keywords are related to the product being advertised, and 3. people see the ad from the winning bid when they search for a term that includes the keyword or when they browse content that is related to the keyword (some Google Ads are shown on sites that use Google AdSense).

There is a further nuance to the last step. Generally, on popular keywords, Google has thousands of candidate ads to choose from. And Google doesn’t simply choose the ad from the winning bid. Instead, it uses data to choose an ad (or a few ads) that yield the most profit (Click Through Rate (CTR)*bid). (Google probably has a more complex user utility function and doesn’t show ads below a low predicted CTR*bid.) In all, who Google shows ads to depends on the predicted CTR and the money it will make per click.

Given this setup, we can reason about the audience for an ad. First, the higher the bid, the broader the audience. Second, it is not clear how well Google can predict CTR per ad conditional on keyword bid especially when the ad run is small. And if that is so, we expect Google to show the ad with the highest bid to a random subset of people searching for the keyword or browsing content related to the keyword. Under such conditions, you can use the total number of impressions per demographic group as an indicator of interest in the keyword. For instance, if you make the highest bid on the keyword ‘election’ and you find that total number of impressions that your ad makes among people 65+ are 10x more than people between ages 18-24, under some assumptions, e.g., similar use of ad blockers, similar rates of clicking ads conditional on relevance (which would become same as predicted relevance), similar utility functions (that is younger people are not more sensitive to irritation from irrelevant ads than older people), etc., you can infer relative interest of 18-24 versus 65+ in elections.

The other case where you can infer relative interest in a keyword (topic) from impressions is when ad markets are thin. For common keywords like ‘elections,’ Google generally has thousands of candidate ads for national campaigns. But if you only want to show your ad in a small geographic area or an infrequently searched term, the candidate set can be pretty small. If your ad is the only one, then your ad will be shown wherever it exceeds some minimum threshold of predicted CTR*bid. Assuming a high enough bid, you can take the total number of impressions of an ad as a proxy for total searches for the term and how often people browsed related content.

With all of this in mind, I discuss results from a Google Ads campaign. More here.

The Other Side

23 Oct

Samantha Laine Perfas of the Christian Science Monitor interviewed me about the gap between perceptions and reality for her podcast ‘perception gaps’ over a month ago. You can listen to the episode here (Episode 2).

The Monitor has also made the transcript of the podcast available here. Some excerpts:

“Differences need not be, and we don’t expect them to be, reasons why people dislike each other. We are all different from each other, right. …. Each person is unique, but we somehow seem to make a big fuss about certain differences and make less of a fuss about certain other differences.”

One way to fix it:

If you know so little and assume so much, … the answer is [to] simply stop doing that. Learn a little bit, assume a little less, and see where the conversation goes.

The interview is based on the following research:

  1. Partisan Composition (pdf) and Measuring Shares of Partisan Composition (pdf)
  2. Affect Not Ideology (pdf)
  3. Coming to Dislike (pdf)
  4. All in the Eye of the Beholder (pdf)

Related blog posts and think pieces:

  1. Party Time
  2. Pride and Prejudice
  3. Loss of Confidence
  4. How to read Ahler and Sood

Don’t Expose Yourself! Discretionary Exposure to Political Information

10 Oct

As the options have grown, so have the fears. Are the politically disinterested taking advantage of the nearly limitless options to opt out of news entirely? Are the politically interested siloing themselves into “echo chambers”? In an eponymous Oxford Research Encylopedia article, I discuss what we think we know, and some concerns about how we can know. Some key points:

  • Is the gap between how much the politically interested and politically disinterested know about politics increasing, as Post-broadcast Democracy posits? Figure 1 suggests not.

  • Quantity rather than ratio: “If the dependent variable is partisan affect, how ‘selective’ one is may not matter as much as the net imbalance in consumption—the difference between the number of congenial and uncongenial bits consumed…”

  • To measure how much political information a person is consuming, you must be able to distinguish political information from its complement. But what isn’t political information? “In this chapter, our focus is on consumption of varieties of political information. The genus is political information. And the species of this genus differ in congeniality, among other things. But what is political information? All information that influences people’s political attitudes or behaviors? If so, then limiting ourselves to news is likely too constraining. Popular television shows like The Handmaid’s Tale, Narcos, and Law and Order have clear political themes. … Shows like Will and Grace and The Cosby Show may be less clearly political, but they also have a political subtext.” (see Figure 4) … “Even if we limit ourselves to news, the domain is still not clear. Is news about a bank robbery relevant political information? What about Hillary Clinton’s haircut? To the extent that each of these affect people’s attitudes, they are arguably pertinent. “

  • One of the challenges with inferring consumption based on domain level data is that domain level data are crude. Going to http://nytimes.com is not the same as reading political news. And measurement error may vary by the kind of person. For instance, say we label http://nytimes.com as political news. For the political junkie, the measurement error may be close to zero. For teetotalers, it may be close to 100% (see more).

  • Show people a few news headlines along with the news source (you can randomize the source). What can you learn from a few such ‘trials’? You cannot learn what proportion of news they get from a particular source. you can learn the preferences, but not reliably. More from the paper: “Given the problems with self-reports, survey instruments that rely on behavioral measures are plausibly better. … We coded congeniality trichotomously: congenial, neutral, or uncongenial. The correlations between trials are alarmingly low. The polychoric correlation between any two trials range between .06 to .20. And the correlation between choosing political news in any two trials is between -.01 and .05.”

  • Following up on the previous point: preference for a source which has a mean slant != preference for slanted news. “Current measures of [selective exposure] are beset with five broad problems. First is conceptual errors. For instance, people frequently equate preference for information from partisan sources with a preference for congenial information.”

Code 44: How to Read Ahler and Sood

27 Jun

This is a follow-up to the hilarious Twitter thread about the sequence of 44s. Numbers in Perry’s 538 piece come from this paper.

First, yes 44s are indeed correct. (Better yet, look for yourself.) But what do the 44s refer to? 44 is the average of all the responses. When Perry writes “Republicans estimated the share at 46 percent,” (we have similar language in the paper, which is regrettable as it can be easily misunderstood), it doesn’t mean that every Republican thinks so. It may not even mean that the median Republican thinks so. See OA 1.7 for medians, OA 1.8 for distributions, but see also OA 2.8.1, Table OA 2.18, OA 2.8.2, OA 2.11 and Table OA 2.23.

Key points =

1. Large majorities overestimate the share of party-stereotypical groups in the party, except for Evangelicals and Southerners.

2. Compared to what people think is the share of a group in the population, people still think the share of the group in the stereotyped party is greater. (But how much more varies a fair bit.)

3. People also generally underestimate the share of counter-stereotypical groups in the party.