American presidential political campaigns, big construction projects, and big-budget moviemaking have a lot in common. They are all complex enterprises with lots of moving parts, they all bring together lots of people for a short period, and they all need people to hit the ground running and execute in lockstep to succeed. Success in these activities relies a lot on great software and the ability to hire competent people quickly. It remains an open opportunity to build great software for these industries, software that allows people to plan and execute together.
Prejudice is a blight on humanity. How to reduce prejudice, thus, is among the most important social scientific questions. In the latest assessment of research in the area, a follow-up to the 2009 Annual Review article, Betsy Paluck et al., however, paint a dim picture. In particular, they note three dismaying things:
Table 1 (see below) makes for grim reading. While one could argue that the pattern is explained by the fact that lab research tends to have smaller samples and has especially powerful treatments, the numbers suggest—see the average s.e. of the first two rows (it may have been useful to produce a $sqrt(1/n)$ adjusted s.e.)—that publication bias very likely plays a large role. It is also shocking to know that just a fifth of the studies have treatment groups with 78 or more people.
Light Touch Interventions
The article is remarkably measured when talking about the rise of ‘light touch’ interventions—short exposure treatments. I would have described them as ‘magical thinking’ for they seem to be founded in the belief that we can make profound changes in people’s thinking on the cheap. This isn’t to say light-touch interventions can’t be worked into a regime that affects profound change—repeated light touches may work. However, as far as I could tell, no study tried multiple touches to see how the effect cumulates.
Near Contemporaneous Measurement of Dependent Variables
Very few papers judged the efficacy of the intervention a day or more after the intervention. Given the primary estimate of interest is longer-term effects, it is hard to judge the efficacy of the treatments in moving the needle on the actual quantity of interest.
Beyond what the paper notes, here are a couple more things to consider:
- Perspective getting works better than perspective-taking. It would be good to explore this further in inter-group settings.
- One way to categorize ‘basic research interventions’ is by decomposing the treatment into its primary aspects and then slowly building back up bundles based on data:
- channel: f2f, audio (radio, etc.), visual (photos, etc.), audio-visual (tv, web, etc.), VR, etc.
- respondent action: talk, listen, see, imagine, reflect, play with a computer program, work together with someone, play together with someone, receive a public scolding, etc.
- source: peers, strangers, family, people who look like you, attractive people, researchers, authorities, etc.
- message type: parable, allegory, story, graph, table, drama, etc.
- message content: facts, personal stories, examples, Jonathan Haidt style studies that show some of the roots of our morality are based on poor logic, etc.
Content delivery is not optimized for the technical stack used by an overwhelming majority of people. The technical stack of people who aren’t particularly tech-savvy, especially those who are old (over ~60 years), is often a messaging application like FB Messenger or WhatsApp. They currently do not have a way to ‘subscribe’ to Substack newsletters or podcasts or Youtube videos in the messaging application that they use (see below for an illustration of how this may look on the iPhone messaging app.) They miss content. And content producers have an audience hole.
A lot of the content is distributed only via email or distributed within a specific application. There are good strategic reasons for that—you get to monitor consumption, recommend accordingly, control monetization, etc. But the reason why platforms like Substack, which enable independent content producers, limit distribution to email is not as immediately clear. It is unlikely a deliberate decision. It is likely a decision based on a lack of infrastructure that connects publishing to various messaging platforms. The future of messaging platforms is Slack—a platform that integrates as many applications as possible. As Whatsapp rolls out its business API, there is a potential to build an integration that allows producers to deliver premium content, leverage other kinds of monetization, like ads, and even build a recommendation stack. Eventually, it would be great to build that kind of integration for each of the messaging platforms, including iMessage, FB Messenger, etc.
Let me end by noting that there is something special about WhatsApp. No one has replicated the mobile phone-based messaging platform. And the idea of enabling a larger stack based on phone numbers remains unplumbed. Duo and FaceTime are great examples but there is potential for so much more. For instance, a calendar app. that runs on the mobile phone ID architecture.
The information age has bought both bounty and pestilence. Today, we are deluged with both correct and incorrect information. If we knew how to tell apart correct claims from incorrect, we would have inched that much closer to utopia. But the lack of nous in telling apart generally ‘obvious’ incorrect claims from correct claims has brought us close to the precipice of disarray. Thus, improving people’s ability to identify untrustworthy claims as such takes on urgency.http://gojiberries.io/2020/08/31/the-misinformation-age-measuring-and-improving-digital-literacy/
Inferring the Quality of Evidence Behind the Claims: Fact Check and Beyond
One way around misinformation is to rely on an expert army that assesses the truth value of claims. However, assessing the truth value of a claim is hard. It needs expert knowledge and careful research. When validating, we have to identify with which parts are wrong, which parts are right but misleading, and which parts are debatable. All in all, it is a noisy and time-consuming process to vet a few claims. Fact check operations, hence, cull a small number of claims and try to validate those claims. As the rate of production of information increases, thwarting misinformation by checking all the claims seems implausibly expensive.
Rather than assess the claims directly, we can assess the process. Or, in particular, the residue of one part of the process for making the claim—sources. Except for claims based on private experience, e.g., religious experience, claims are based on sources. We can use the features of these sources to infer credibility. The first feature is the number of sources cited to make a claim. All else equal, the more number of sources saying the same thing, the greater the chances that the claim is true. None of this is to undercut a common observation: lots of people can be wrong about something. A harder test for veracity if a diverse set of people say the same thing. The third test is checking the credibility of the sources.
Relying on the residue is not a panacea. People can simply lie about the source. We want the source to verify what they have been quoted as saying. And in the era of cheap data, this can be easily enabled. Quotes can be linked to video interviews or automatic transcriptions electronically signed by the interviewee. The same system can be scaled to institutions. The downside is that the system may prove onerous. On the other hand, commonly, the same source is cited by many people so a public repository of verified claims and evidence can mitigate much of the burden.
But will this solve the problem? Likely not. For one, people can still commit sins of omission. For two, they can still draft things in misleading ways. For three, trust in sources may not be tied to correctness. All we have done is built a system for establishing provenance. And establishing the provenance is not enough. Instead, we need a system that incentivizes both correctness and presentation that makes correct interpretation highly likely. It is a high bar. But it is the bar—correct and liable to correctly interpreted.
To create incentives for publishing correct claims, we need to either 1. educate the population, which brings me to the previous post, or 2. find ways to build products and recommendations that incentivize correct claims. We likely need both.
The information age has bought both bounty and pestilence. Today, we are deluged with both correct and incorrect information. If we knew how to tell apart correct claims from incorrect, we would have inched that much closer to utopia. But the lack of nous in telling apart generally ‘obvious’ incorrect claims from correct claims has brought us close to the precipice of disarray. Thus, improving people’s ability to identify untrustworthy claims as such takes on urgency.
Before we find fixes, it is good to measure how bad things are and what things are bad. This is the task the following paper sets itself by creating a ‘digital literacy’ scale. (Digital literacy is an overloaded term. It means many different things, from the ability to find useful information, e.g., information about schools or government programs, to the ability to protect yourself against harm online (see here and here for how frequently people’s accounts are breached and how often they put themselves at risk of malware or phishing), to the ability to identify incorrect claims as such, which is how the paper uses it.)
Rather than build a skill assessment kind of a scale, the paper measures (really predicts) skills indirectly using some other digital literacy scales, whose primary purpose is likely broader. The paper validates the importance of various constituent items using variable importance and model fit kinds of measures. There are a few dangers of doing that:
- Inference using surrogates is dangerous as the weakness of surrogates cannot be fully explored with one dataset. And they are liable not to generalize as underlying conditions change. We ideally want measures that directly measure the construct.
- Variable importance is not the same as important variables. For instance, it isn’t clear why “recognition of the term RSS,” the “highest-performing item by far” has much to do with skill in identifying untrustworthy claims.
Some other work builds uncalibrated measures of digital literacy (conceived as in the previous paper). As part of an effort to judge the efficacy of a particular way of educating people about how to judge untrustworthy claims, the paper provides measures of trust in claims. The topline is that educating people is not hard (see the appendix for the description of the treatment). A minor treatment (see below) is able to improve “discernment between mainstream and false news headlines.”
Understandably, the effects of this short treatment are ‘small.’ The ITT short-term effect in the US is: “a decrease of nearly 0.2 points on a 4-point scale.” Later in the manuscript, the authors provide the substantive magnitude of the .2 pt net swing using a binary indicator of perceived headline accuracy: “The proportion of respondents rating a false headline as “very accurate” or “somewhat accurate” decreased from 32% in the control condition to 24% among respondents who were assigned to the media literacy intervention in wave 1, a decrease of 7 percentage points.” The .2 pt. net swing on a 4 point scale leading to a 7% difference is quite remarkable and generally suggests that there is a lot of ‘reverse’ intra-category movement that the crude dichotomization elides over. But even if we take the crude categories as the quantity of interest, a month later in the US, the 7 percent swing is down to 4 percent:
“…the intervention reduced the proportion of people endorsing false headlines as accurate from 33 to 29%, a 4-percentage-point effect. By contrast, the proportion of respondents who classified mainstream news as not very accurate or not at all accurate rather than somewhat or very accurate decreased only from 57 to 55% in wave 1 and 59 to 57% in wave 2.Guess et al. 2020
The opportunity to mount more ambitious treatments remains sizable. So does the opportunity to more precisely understand what aspects of the quality of evidence people find hard to discern. And how we could release products that make their job easier.
By Rob Lytle
At this point, it’s well established that the ANES CDF’s codebook is not to be trusted (I’m repeating “not to be trusted to include a second link!). Recently, I stumbled across another example of incorrect coding in the cumulative data file, this time in
VCF0731 – Do you ever discuss politics with your family or friends?
The codebook reports 5 levels:
Do you ever discuss politics with your family or friends? 1. Yes 5. No 8. DK 9. NA INAP. question not used
However, when we load the variable and examine the unique values:
# pulling anes-cdf from a GitHub repository cdf <- rio::import("https://github.com/RobLytle/intra-party-affect/raw/master/data/raw/cdf-raw-trim.rds") unique(cdf$VCF0731)
##  NA 5 1 6 7
We see a completely different coding scheme. We are left adrift, wondering “What is
6? What is
5 really mean “yes” and “no”?
We may never know.
For a survey that costs several million dollars to conduct, you’d think we could expect a double-checked codebook (or at least some kind of version control to easily fix these things as they’re identified).
A new paper purportedly shows that the release of Apple Watch 2018 which supported ECG app did not cause an increase in AFib diagnoses (mean = −0.008).
They make the claim based on 60M visits from and 1270 practices across 2 years.
Here are some things to think about:
- Expected effect size. Say the base AF rate as .41%. Let’s say 10% has the ECG app + Apple watch. (You have to make some assumptions about how quickly people downloaded the app. I am making a generous assumption that 10% do it the day of release.) For the 10%, say it is .51%. Add’l diagnoses expected = .01*30M ~ 3k.
- Time trend. 2018-19 line is significantly higher (given the baseline) than 2016-2017. It is unlikely to be explained by the aging of the population. Is there a time trend? What explains it? More acutely, diff. in diff. doesn’t account for that.
- Choice of the time period. When you have observations over multiple time periods pre-treatment and post-treatment, the inference depends on which time period you use. For instance, if I do an “ocular distortion test”, the diff. in diff. with observations from Aug./Sep. would suggest a large positive impact. For a more transparent account of assumptions, see diff.healthpolicydatascience.org (h/t Kyle Foreman).
- Clustering of s.e. Some correlation in diagnosis because of facility (doctor) which is unaccounted for.
Tools define science. Not only do they determine how science is practiced but also what questions are asked. Take survey experiments, for example. Since the advent of online survey platforms, which made conducting survey experiments trivial, the lure of convenience and internal validity has persuaded legions of researchers to use survey experiments to understand the world.
Conventional survey experiments are modest tools. Paul Sniderman writes,
“These three limitations of survey experiments—modesty of treatment, modesty of scale, and modesty of measurement—need constantly to be borne in mind when brandishing term experiment as a prestige enhancer.” I think we can easily collapse these in two — treatment (which includes ‘scale’ as he defines it— the amount of time) and measurement.Paul Sniderman
Note: We can collapse these three concerns into two— treatment (which includes ‘scale’ as Paul defines it— the amount of time) and measurement.
But skillful artisans have used this modest tool to great effect. Famously, Kahneman and Tversky used survey experiments, e.g., Asian Disease Problem, to shed light on how people decide. More recently, Paul Sniderman and Tom Piazza have used survey experiments to shed light on an unsavory aspect of human decision making: discrimination. Aside from shedding light on human decision making, researchers have also used survey experiments to understand what survey measures mean, e.g., Ahler and Sood.
The good, however, has come with the bad; insight has often come with irreflection. In particular, Paul Sniderman implicitly points to two common mistakes that people make:
- Not Learning From the Control Group. The focus on differences in means means that we sometimes fail to reflect on what the data in the Control Group tells us about the world. Take the paper on partisan expressive responding, for instance. The topline from the paper is that expressive responding explains half of the partisan gap. But it misses the bigger story—the partisan differences in the Control Group are much smaller than what people expect, just about 6.5% (see here). (Here’s what I wrote in 2016.)
- Not Putting the Effect Size in Context. A focus on significance testing means that we sometimes fail to reflect on the modesty of effect sizes. For instance, providing people $1 for a correct answer within the context of an online survey interview is a large premium. And if providing a dollar each on 12 (included) questions nudges people from an average of 4.5 correct responses to 5, it suggests that people are resistant to learning or impressively confident that what they know is right. Leaving $7 on the table tells us more than the .5, around which the paper is written.
More broadly, researchers are obtuse to the point that sometimes what the results show is how impressively modest the movement is when you ratchet up the dosage. For instance, if an overwhelming number of African Americans favor Whites who have scored just a few points more than a Black student, it is a telling testament to their endorsement of meritocracy.
“She took a position—which has actually become very popular in IndiaAmartya Sen
now, not coming from the left these days, but from the right—that what you have to concentrate on is simply maximizing economic growth. Once you have grown and become rich, then you can do health care, education, and all this other stuff. Which I think is one of the more profound errors that you can make in development planning. Somehow Joan had a lot of sympathy for that position. In fact, she strongly criticized Sri Lanka for offering highly subsidized food to everyone on nutritional grounds. I remember the phrase she used: “Sri Lanka is trying to taste the fruit of
the tree without growing it.”
“On the unemployment issue I may well be, but if I compare an economistAmartya Sen
like Keynes, who never took a serious interest in inequality, in poverty, in the environment, with Pigou, who took an interest in all of them, I don’t think I would be able to say exactly what you are asking me to say.”
On the 1943 Bengal Famine, the last big famine in India in which ~ 3M people perished:
“Basically I had figured out on the basis of the little information I had (that indeed
everyone had) that the problem was not that the British had the wrong data, but that their theory of famine was completely wrong. The government was claiming that there was so much food in Bengal that there couldn’t be a famine. Bengal, as a whole, did indeed have a lot of food—that’s true. But that’s supply; there’s also demand, which was going up and up rapidly, pushing prices sky-high. Those left behind in a boom economy—a boom generated by the war—lost out in the competition for buying food.”
“I learned also—which I knew as a child—that you could have a famine with a lot of food around. And how the country is governed made a difference. The British did not want rebellion in Calcutta. I believe no one of Calcutta died in the famine. People died in Calcutta, but they were not of Calcutta. They came from elsewhere, because what little charity there was came from Indian businessmen based in Calcutta. The starving peopleAmartya Sen
kept coming into Calcutta in search of free food, but there was really not much of that. The Calcutta people were entirely protected by the Raj to prevent discontent of established people during the war. Three million people in Calcutta had ration cards, which entailed that at least six million people were being fed at a very subsidized price of food. What the government did was to buy rice at whatever price necessary to purchase it in the rural areas, making the rural prices shoot up. The price of rationed food in Calcutta for established residents was very low and highly subsidized, though the market price in Calcutta—outside the rationing network—rose with the rural price increase.”
On John Smith
“He discussed why you have to think pragmatically about the different institutions to be combined together, paying close attention to how they respectively work. There’s a passage where he’s asking himself the question, Why do we strongly want a good political economy? Why is it important? One answer—not the only one—is that it will lead to high economic growth (this is my language, not Smith’s). I’m not quoting his words, but he talks about the importance of high growth, high rate of progress. But why is that important? He says it’s important for two distinct reasons. First, it gives the individual more income, which in turn helps people to do what they would value doing. Smith is talking here about people having more capability. He doesn’t use the word capability, but that’s what he is talking about here. More income helps you to choose the kind of life that you’d like to lead. Second, it gives the state (which he greatly valued as an institution when properly used) more revenue, allowing it to do those things which only the state can do well. As an example, he talks about the state being able to provide free school education.”Amartya Sen
“Thus, when we calculate the net degree of expressive responding by subtracting the acceptance effect from the rejection effect—essentially differencing off the baseline effect of the incentive from the reduction in rumor acceptance with payment—we find that the net expressive effect is negative 0.5%—the opposite sign of what we would expect if there was expressive responding. However, the substantive size of the estimate of the expressive effect is trivial. Moreover, the standard error on this estimate is 10.6, meaning the estimate of expressive responding is essentially zero.”https://journals.uchicago.edu/doi/abs/10.1086/694258
(Note: This is not a full review of all the claims in the paper. There is more data in the paper than in the quote above. I am merely using the quote to clarify a couple of statistical points.)
There are two main points:
- The fact that estimate is close to zero and the s.e. is super fat are technically unrelated. The last line of the quote, however, seems to draw a relationship between the two.
- The estimated effect sizes of expressive responding in the literature are much smaller than the s.e. Bullock et al. (Table 2) estimate the effect of expressive responding at about 4% and Prior et al. (Figure 1) at about ~ 5.5% (“Figure 1(a) shows, the model recovers the raw means from Table 1, indicating a drop in bias from 11.8 to 6.3.”). Thus, one reasonable inference is that the study is underpowered to reasonably detect expected effect sizes.