Dismissed Without Prejudice: Evaluating Prejudice Reduction Research

25 Sep

Prejudice is a blight on humanity. How to reduce prejudice, thus, is among the most important social scientific questions. In the latest assessment of research in the area, a follow-up to the 2009 Annual Review article, Betsy Paluck et al., however, paint a dim picture. In particular, they note three dismaying things:

Publication Bias

Table 1 (see below) makes for grim reading. While one could argue that the pattern is explained by the fact that lab research tends to have smaller samples and has especially powerful treatments, the numbers suggest—see the average s.e. of the first two rows (it may have been useful to produce a $sqrt(1/n)$ adjusted s.e.)—that publication bias very likely plays a large role. It is also shocking to know that just a fifth of the studies have treatment groups with 78 or more people.

Light Touch Interventions

The article is remarkably measured when talking about the rise of ‘light touch’ interventions—short exposure treatments. I would have described them as ‘magical thinking’ for they seem to be founded in the belief that we can make profound changes in people’s thinking on the cheap. This isn’t to say light-touch interventions can’t be worked into a regime that affects profound change—repeated light touches may work. However, as far as I could tell, no study tried multiple touches to see how the effect cumulates.

Near Contemporaneous Measurement of Dependent Variables

Very few papers judged the efficacy of the intervention a day or more after the intervention. Given the primary estimate of interest is longer-term effects, it is hard to judge the efficacy of the treatments in moving the needle on the actual quantity of interest.   

Beyond what the paper notes, here are a couple more things to consider:

  1. Perspective getting works better than perspective-taking. It would be good to explore this further in inter-group settings.
  2. One way to categorize ‘basic research interventions’ is by decomposing the treatment into its primary aspects and then slowly building back up bundles based on data:
    1. channel: f2f, audio (radio, etc.), visual (photos, etc.), audio-visual (tv, web, etc.), VR, etc.
    2. respondent action: talk, listen, see, imagine, reflect, play with a computer program, work together with someone, play together with someone, receive a public scolding, etc.
    3. source: peers, strangers, family, people who look like you, attractive people, researchers, authorities, etc.
    4. message type: parable, allegory, story, graph, table, drama, etc.
    5. message content: facts, personal stories, examples, Jonathan Haidt style studies that show some of the roots of our morality are based on poor logic, etc.

everywhere: meeting consumers where they are

1 Sep

Content delivery is not optimized for the technical stack used by an overwhelming majority of people. The technical stack of people who aren’t particularly tech-savvy, especially those who are old (over ~60 years), is often a messaging application like FB Messenger or WhatsApp. They currently do not have a way to ‘subscribe’ to Substack newsletters or podcasts or Youtube videos in the messaging application that they use (see below for an illustration of how this may look on the iPhone messaging app.) They miss content. And content producers have an audience hole.

Credit: Gaurav Gandhi

A lot of the content is distributed only via email or distributed within a specific application. There are good strategic reasons for that—you get to monitor consumption, recommend accordingly, control monetization, etc. But the reason why platforms like Substack, which enable independent content producers, limit distribution to email is not as immediately clear. It is unlikely a deliberate decision. It is likely a decision based on a lack of infrastructure that connects publishing to various messaging platforms. The future of messaging platforms is Slack—a platform that integrates as many applications as possible. As Whatsapp rolls out its business API, there is a potential to build an integration that allows producers to deliver premium content, leverage other kinds of monetization, like ads, and even build a recommendation stack. Eventually, it would be great to build that kind of integration for each of the messaging platforms, including iMessage, FB Messenger, etc.

Let me end by noting that there is something special about WhatsApp. No one has replicated the mobile phone-based messaging platform. And the idea of enabling a larger stack based on phone numbers remains unplumbed. Duo and FaceTime are great examples but there is potential for so much more. For instance, a calendar app. that runs on the mobile phone ID architecture.

The (Mis)Information Age: Provenance is Not Enough

31 Aug

The information age has bought both bounty and pestilence. Today, we are deluged with both correct and incorrect information. If we knew how to tell apart correct claims from incorrect, we would have inched that much closer to utopia. But the lack of nous in telling apart generally ‘obvious’ incorrect claims from correct claims has brought us close to the precipice of disarray. Thus, improving people’s ability to identify untrustworthy claims as such takes on urgency.

http://gojiberries.io/2020/08/31/the-misinformation-age-measuring-and-improving-digital-literacy/

Inferring the Quality of Evidence Behind the Claims: Fact Check and Beyond

One way around misinformation is to rely on an expert army that assesses the truth value of claims. However, assessing the truth value of a claim is hard. It needs expert knowledge and careful research. When validating, we have to identify with which parts are wrong, which parts are right but misleading, and which parts are debatable. All in all, it is a noisy and time-consuming process to vet a few claims. Fact check operations, hence, cull a small number of claims and try to validate those claims. As the rate of production of information increases, thwarting misinformation by checking all the claims seems implausibly expensive.

Rather than assess the claims directly, we can assess the process. Or, in particular, the residue of one part of the process for making the claim—sources. Except for claims based on private experience, e.g., religious experience, claims are based on sources. We can use the features of these sources to infer credibility. The first feature is the number of sources cited to make a claim. All else equal, the more number of sources saying the same thing, the greater the chances that the claim is true. None of this is to undercut a common observation: lots of people can be wrong about something. A harder test for veracity if a diverse set of people say the same thing. The third test is checking the credibility of the sources.

Relying on the residue is not a panacea. People can simply lie about the source. We want the source to verify what they have been quoted as saying. And in the era of cheap data, this can be easily enabled. Quotes can be linked to video interviews or automatic transcriptions electronically signed by the interviewee. The same system can be scaled to institutions. The downside is that the system may prove onerous. On the other hand, commonly, the same source is cited by many people so a public repository of verified claims and evidence can mitigate much of the burden.

But will this solve the problem? Likely not. For one, people can still commit sins of omission. For two, they can still draft things in misleading ways. For three, trust in sources may not be tied to correctness. All we have done is built a system for establishing provenance. And establishing the provenance is not enough. Instead, we need a system that incentivizes both correctness and presentation that makes correct interpretation highly likely. It is a high bar. But it is the right bar—correct and liable to be correctly interpreted.

To create incentives for publishing correct claims, we need to either 1. educate the population, which brings me to the previous post, or 2. find ways to build products and recommendations that incentivize correct claims. We likely need both.

The (Mis)Information Age: Measuring and Improving ‘Digital Literacy’

31 Aug

The information age has bought both bounty and pestilence. Today, we are deluged with both correct and incorrect information. If we knew how to tell apart correct claims from incorrect, we would have inched that much closer to utopia. But the lack of nous in telling apart generally ‘obvious’ incorrect claims from correct claims has brought us close to the precipice of disarray. Thus, improving people’s ability to identify untrustworthy claims as such takes on urgency.

Before we find fixes, it is good to measure how bad things are and what things are bad. This is the task the following paper sets itself by creating a ‘digital literacy’ scale. (Digital literacy is an overloaded term. It means many different things, from the ability to find useful information, e.g., information about schools or government programs, to the ability to protect yourself against harm online (see here and here for how frequently people’s accounts are breached and how often they put themselves at risk of malware or phishing), to the ability to identify incorrect claims as such, which is how the paper uses it.)

Rather than build a skill assessment kind of a scale, the paper measures (really predicts) skills indirectly using some other digital literacy scales, whose primary purpose is likely broader. The paper validates the importance of various constituent items using variable importance and model fit kinds of measures. There are a few dangers of doing that:

  1. Inference using surrogates is dangerous as the weakness of surrogates cannot be fully explored with one dataset. And they are liable not to generalize as underlying conditions change. We ideally want measures that directly measure the construct.
  2. Variable importance is not the same as important variables. For instance, it isn’t clear why “recognition of the term RSS,” the “highest-performing item by far” has much to do with skill in identifying untrustworthy claims.

Some other work builds uncalibrated measures of digital literacy (conceived as in the previous paper). As part of an effort to judge the efficacy of a particular way of educating people about how to judge untrustworthy claims, the paper provides measures of trust in claims. The topline is that educating people is not hard (see the appendix for the description of the treatment). A minor treatment (see below) is able to improve “discernment between mainstream and false news headlines.”

Understandably, the effects of this short treatment are ‘small.’ The ITT short-term effect in the US is: “a decrease of nearly 0.2 points on a 4-point scale.” Later in the manuscript, the authors provide the substantive magnitude of the .2 pt net swing using a binary indicator of perceived headline accuracy: “The proportion of respondents rating a false headline as “very accurate” or “somewhat accurate” decreased from 32% in the control condition to 24% among respondents who were assigned to the media literacy intervention in wave 1, a decrease of 7 percentage points.” The .2 pt. net swing on a 4 point scale leading to a 7% difference is quite remarkable and generally suggests that there is a lot of ‘reverse’ intra-category movement that the crude dichotomization elides over. But even if we take the crude categories as the quantity of interest, a month later in the US, the 7 percent swing is down to 4 percent:

“…the intervention reduced the proportion of people endorsing false headlines as accurate from 33 to 29%, a 4-percentage-point effect. By contrast, the proportion of respondents who classified mainstream news as not very accurate or not at all accurate rather than somewhat or very accurate decreased only from 57 to 55% in wave 1 and 59 to 57% in wave 2.

Guess et al. 2020

The opportunity to mount more ambitious treatments remains sizable. So does the opportunity to more precisely understand what aspects of the quality of evidence people find hard to discern. And how we could release products that make their job easier.

Another ANES Goof-em-up: VCF0731

30 Aug

By Rob Lytle

At this point, it’s well established that the ANES CDF’s codebook is not to be trusted (I’m repeating “not to be trusted to include a second link!). Recently, I stumbled across another example of incorrect coding in the cumulative data file, this time in VCF0731 – Do you ever discuss politics with your family or friends?

The codebook reports 5 levels:

Do you ever discuss politics with your family or friends?

1. Yes
5. No

8. DK
9. NA

INAP. question not used

However, when we load the variable and examine the unique values:

# pulling anes-cdf from a GitHub repository
cdf <- rio::import("https://github.com/RobLytle/intra-party-affect/raw/master/data/raw/cdf-raw-trim.rds")


unique(cdf$VCF0731)
## [1] NA  5  1  6  7

We see a completely different coding scheme. We are left adrift, wondering “What is 6? What is 7?” Do 1 and 5 really mean “yes” and “no”?

We may never know.

For a survey that costs several million dollars to conduct, you’d think we could expect a double-checked codebook (or at least some kind of version control to easily fix these things as they’re identified).

AFib: Apple Watch Did Not Increase Atrial Fibrillation Diagnoses

28 Aug

A new paper purportedly shows that the release of Apple Watch 2018 which supported ECG app did not cause an increase in AFib diagnoses (mean = −0.008). 

They make the claim based on 60M visits from and 1270 practices across 2 years.

Here are some things to think about:

  1. Expected effect size. Say the base AF rate as .41%. Let’s say 10% has the ECG app + Apple watch. (You have to make some assumptions about how quickly people downloaded the app. I am making a generous assumption that 10% do it the day of release.) For the 10%, say it is .51%. Add’l diagnoses expected = .01*30M ~ 3k.
  2. Time trend. 2018-19 line is significantly higher (given the baseline) than 2016-2017. It is unlikely to be explained by the aging of the population. Is there a time trend? What explains it? More acutely, diff. in diff. doesn’t account for that.
  3. Choice of the time period. When you have observations over multiple time periods pre-treatment and post-treatment, the inference depends on which time period you use. For instance,  if I do an “ocular distortion test”, the diff. in diff. with observations from Aug./Sep. would suggest a large positive impact. For a more transparent account of assumptions, see diff.healthpolicydatascience.org (h/t Kyle Foreman).
  4. Clustering of s.e. Some correlation in diagnosis because of facility (doctor) which is unaccounted for.

Survey Experiments With Truth: Learning From Survey Experiments

27 Aug

Tools define science. Not only do they determine how science is practiced but also what questions are asked. Take survey experiments, for example. Since the advent of online survey platforms, which made conducting survey experiments trivial, the lure of convenience and internal validity has persuaded legions of researchers to use survey experiments to understand the world.

Conventional survey experiments are modest tools. Paul Sniderman writes,

“These three limitations of survey experiments—modesty of treatment, modesty of scale, and modesty of measurement—need constantly to be borne in mind when brandishing term experiment as a prestige enhancer.” I think we can easily collapse these in two — treatment (which includes ‘scale’ as he defines it— the amount of time) and measurement.

Paul Sniderman

Note: We can collapse these three concerns into two— treatment (which includes ‘scale’ as Paul defines it— the amount of time) and measurement.

But skillful artisans have used this modest tool to great effect. Famously, Kahneman and Tversky used survey experiments, e.g., Asian Disease Problem, to shed light on how people decide. More recently, Paul Sniderman and Tom Piazza have used survey experiments to shed light on an unsavory aspect of human decision making: discrimination. Aside from shedding light on human decision making, researchers have also used survey experiments to understand what survey measures mean, e.g., Ahler and Sood

The good, however, has come with the bad; insight has often come with irreflection. In particular, Paul Sniderman implicitly points to two common mistakes that people make:

  1. Not Learning From the Control Group. The focus on differences in means means that we sometimes fail to reflect on what the data in the Control Group tells us about the world. Take the paper on partisan expressive responding, for instance. The topline from the paper is that expressive responding explains half of the partisan gap. But it misses the bigger story—the partisan differences in the Control Group are much smaller than what people expect, just about 6.5% (see here). (Here’s what I wrote in 2016.)
  2. Not Putting the Effect Size in Context. A focus on significance testing means that we sometimes fail to reflect on the modesty of effect sizes. For instance, providing people $1 for a correct answer within the context of an online survey interview is a large premium. And if providing a dollar each on 12 (included) questions nudges people from an average of 4.5 correct responses to 5, it suggests that people are resistant to learning or impressively confident that what they know is right. Leaving $7 on the table tells us more than the .5, around which the paper is written. 

    More broadly, researchers are obtuse to the point that sometimes what the results show is how impressively modest the movement is when you ratchet up the dosage. For instance, if an overwhelming number of African Americans favor Whites who have scored just a few points more than a Black student, it is a telling testament to their endorsement of meritocracy.

Amartya Sen on Keynes, Robinson, Smith, and the Bengal Famine

17 Aug

Sen in conversation with Angus Deaton and Tim Besleypdf and video.

Excepts:

On Joan Robinson

“She took a position—which has actually become very popular in India
now, not coming from the left these days, but from the right—that what you have to concentrate on is simply maximizing economic growth. Once you have grown and become rich, then you can do health care, education, and all this other stuff. Which I think is one of the more profound errors that you can make in development planning. Somehow Joan had a lot of sympathy for that position. In fact, she strongly criticized Sri Lanka for offering highly subsidized food to everyone on nutritional grounds. I remember the phrase she used: “Sri Lanka is trying to taste the fruit of
the tree without growing it.”

Amartya Sen

On Keynes:

“On the unemployment issue I may well be, but if I compare an economist
like Keynes, who never took a serious interest in inequality, in poverty, in the environment, with Pigou, who took an interest in all of them, I don’t think I would be able to say exactly what you are asking me to say.”

Amartya Sen

On the 1943 Bengal Famine, the last big famine in India in which ~ 3M people perished:

“Basically I had figured out on the basis of the little information I had (that indeed
everyone had) that the problem was not that the British had the wrong data, but that their theory of famine was completely wrong. The government was claiming that there was so much food in Bengal that there couldn’t be a famine. Bengal, as a whole, did indeed have a lot of food—that’s true. But that’s supply; there’s also demand, which was going up and up rapidly, pushing prices sky-high. Those left behind in a boom economy—a boom generated by the war—lost out in the competition for buying food.”

“I learned also—which I knew as a child—that you could have a famine with a lot of food around. And how the country is governed made a difference. The British did not want rebellion in Calcutta. I believe no one of Calcutta died in the famine. People died in Calcutta, but they were not of Calcutta. They came from elsewhere, because what little charity there was came from Indian businessmen based in Calcutta. The starving people
kept coming into Calcutta in search of free food, but there was really not much of that. The Calcutta people were entirely protected by the Raj to prevent discontent of established people during the war. Three million people in Calcutta had ration cards, which entailed that at least six million people were being fed at a very subsidized price of food. What the government did was to buy rice at whatever price necessary to purchase it in the rural areas, making the rural prices shoot up. The price of rationed food in Calcutta for established residents was very low and highly subsidized, though the market price in Calcutta—outside the rationing network—rose with the rural price increase.”

Amartya Sen

On John Smith

“He discussed why you have to think pragmatically about the different institutions to be combined together, paying close attention to how they respectively work. There’s a passage where he’s asking himself the question, Why do we strongly want a good political economy? Why is it important? One answer—not the only one—is that it will lead to high economic growth (this is my language, not Smith’s). I’m not quoting his words, but he talks about the importance of high growth, high rate of progress. But why is that important? He says it’s important for two distinct reasons. First, it gives the individual more income, which in turn helps people to do what they would value doing. Smith is talking here about people having more capability. He doesn’t use the word capability, but that’s what he is talking about here. More income helps you to choose the kind of life that you’d like to lead. Second, it gives the state (which he greatly valued as an institution when properly used) more revenue, allowing it to do those things which only the state can do well. As an example, he talks about the state being able to provide free school education.”

Amartya Sen

Nothing to See Here: Statistical Power and “Oversight”

13 Aug

“Thus, when we calculate the net degree of expressive responding by subtracting the acceptance effect from the rejection effect—essentially differencing off the baseline effect of the incentive from the reduction in rumor acceptance with payment—we find that the net expressive effect is negative 0.5%—the opposite sign of what we would expect if there was expressive responding. However, the substantive size of the estimate of the expressive effect is trivial. Moreover, the standard error on this estimate is 10.6, meaning the estimate of expressive responding is essentially zero.

https://journals.uchicago.edu/doi/abs/10.1086/694258

(Note: This is not a full review of all the claims in the paper. There is more data in the paper than in the quote above. I am merely using the quote to clarify a couple of statistical points.)

There are two main points:

  1. The fact that estimate is close to zero and the s.e. is super fat are technically unrelated. The last line of the quote, however, seems to draw a relationship between the two.
  2. The estimated effect sizes of expressive responding in the literature are much smaller than the s.e. Bullock et al. (Table 2) estimate the effect of expressive responding at about 4% and Prior et al. (Figure 1) at about ~ 5.5% (“Figure 1(a) shows, the model recovers the raw means from Table 1, indicating a drop in bias from 11.8 to 6.3.”). Thus, one reasonable inference is that the study is underpowered to reasonably detect expected effect sizes.

Casual Inference: Errors in Everyday Causal Inference

12 Aug

Why are things the way they are? What is the effect of something? Both of these reverse and forward causation questions are vital.

When I was at Stanford, I took a class with a pugnacious psychometrician, David Rogosa. David had two pet peeves, one of which was people making causal claims with observational data. And it is in David’s class that I learned the pejorative for such claims. With great relish, David referred to such claims as ‘casual inference.’ (Since then, I have come up with another pejorative phrase for such claims—cosal inference—as in merely dressing up as causal inference.)

It turns out that despite its limitations, casual inference is quite common. Here are some fashionable costumes:

  1. 7 Habits of Successful People: We have all seen business books with such titles. The underlying message of these books is: adopt these habits, and you will be successful too! Let’s follow the reasoning and see where it falls apart. One stereotype about successful people is that they wake up early. And the implication is you wake up early you can be successful too. It *seems* right. It agrees with folk wisdom that discomfort causes success. But can we reliably draw inferences about what less successful people should do based on what successful people do? No. For one, we know nothing about the habits of less successful people. It could be that less successful people wake up *earlier* than the more successful people. Certainly, growing up in India, I recall daily laborers waking up much earlier than people living in bungalows. And when you think of it, the claim that servants wake up before masters seems uncontroversial. It may even be routine enough to be canonized as a law—the Downtown Abbey law. The upshot is that when you select on the dependent variable, i.e., only look at cases where the variable takes certain values, e.g., only look at the habits of financially successful people, even correlation is not guaranteed. This means that you don’t even get to mock the claim with the jibe that “correlation is not causation.”

    Let’s go back to Goji’s delivery service for another example. One of the ‘tricks’ that we had discussed was to sample failures. If you do that, you are selecting on the dependent variable. And while it is a good heuristic, it can lead you astray. For instance, let’s say that most of the late deliveries our early morning deliveries. You may infer that delivering at another time may improve outcomes. Except, when you look at the data, you find that the bulk of your deliveries are in the morning. And the rate at which deliveries run late is *lower* early morning than during other times.

    There is a yet more famous example of things going awry when you select on the dependent variable. During World War II, statisticians were asked where the armor should be added on the planes. Of the aircraft that returned, the damage was concentrated in a few areas, like the wings. The top-of-head answer is to suggest we reinforce areas hit most often. But if you think about the planes that didn’t return, you get to the right answer, which is that we need to reinforce areas that weren’t hit. In literature, people call this kind of error, survivorship bias. But it is a problem of selecting on the dependent variable (whether or not a plane returned) and selecting on planes that returned.

  2. More frequent system crashes cause people to renew their software license. It is a mistake to treat correlation as causation. There are many different reasons behind why doing so can lead you astray. The rarest reason is that lots of odd things are correlated in the world because of luck alone. The point is hilariously illustrated by a set of graphs showing a large correlation between conceptually unrelated things, e.g., there is a large correlation between total worldwide non-commercial space launches and the number of sociology doctorates that are awarded each year.

    A more common scenario is illustrated by the example in the title of this point. Commonly, there is a ‘lurking’ or ‘confounding’ variable that explains both sides. In our case, the more frequently a person uses a system, the more the number of crashes. And it makes sense that people who use the system most frequently also need the software the most and renew the license most often.

    Another common but more subtle reason is called Simpson’s paradox. Sometimes the correlation you see is “wrong.” You may see a correlation in the aggregate, but the correlation runs the opposite way when you break it down by group. Gender bias in U.C. Berkeley admissions provides a famous example. In 1973, 44% of the men who applied to graduate programs were admitted, whereas only 35% of the women were. But when you split by department, which eventually controlled admissions, women generally had a higher batting average than men. The reason for the reversal was women applied more often to more competitive departments, like—-wait for it—-English and men were more likely to apply to less competitive departments like Engineering. None of this is to say that there isn’t bias against women. It is merely to point out that the pattern in aggregated data may not hold when you split the data into relevant chunks.

    It is also important to keep in mind the opposite of correlation is not causation—lack of correlation does not imply a lack of causation.

  3. Mayor Giuliani brought the NYC crime rate down. There are two potential errors here:
    • Forgetting about ecological trends. Crime rates in other big US cities went down at the same time as they did in NY, sometimes more steeply. When faced with a causal claim, it is good to check how ‘similar’ people fared. The Difference-in-Differences estimator that builds on this intuition.
    • Treating temporally proximate as causal. Say you had a headache, you took some medicine and your headache went away. It could be the case that your headache went away by itself, as headaches often do.

  4. I took this homeopathic medication and my headache went away. If the ailments are real, placebo effects are a bit mysterious. And mysterious they may be but they are real enough. Not accounting for placebo effects misleads us to ascribe the total effect to the medicine. 

  5. Shallow causation. We ascribe too much weight to immediate causes than to causes that are a few layers deeper.

  6.  Monocausation: In everyday conversations, it is common for people to speak as if x is the only cause of y.

  7.  Big Causation: Another common pitfall is reading x causes y as x causes y to change a lot. This is partly a consequence of mistaking statistical significance with substantive significance, and partly a consequence of us not paying close enough attention to numbers.

  8. Same Effect: Lastly, many people take causal claims to mean that the effect is the same across people. 

Routine Maintenance: How to Build Habits

11 Aug

With Mark Paluta

Building a habit means trying to maximize the probability of doing something at some regular cadence.

max [P(do the thing)]

This is difficult because we have time-inconsistent preferences. When asked if we would prefer to run or watch TV next Wednesday afternoon, we are more likely to say run. Arrive Wednesday, and we are more likely to say TV.

Willpower is a weak tool for most of us, so we are better served thinking systematically about what conditions maximize the probability of doing the thing we plan to do. The probability of doing something can be modeled as a function of accountability, external motivation, friction, and awareness of other mental tricks:

P(do the thing) ~ f(accountability, external motivation, friction, other mental tricks)

Accountability: To hold ourselves accountable, at the minimum, we need to record data transparently. Without an auditable record of performance, we are liable to either turn a blind eye to failures or rationalize them away. There are a couple of ways to amplify accountability pressures:

  • Social Pressure: We do not want to embarrass ourselves in front of people we know. This pressures us to do the right thing. So record your commitments and how you follow up on them publicly. Or make a social commitment. “Burn the boats” and tell all your friends you are training for a marathon.
  • Feel the Pain: Donate to an organization you dislike whenever you fail.
  • Enjoy the Rewards: The flip side of feeling the pain is making success sweeter. One way to do that is to give yourself a nice treat if you finish X days of Y.
  • Others Are Counting on You: If you have a workout partner, you are more likely to go because you want to come through for your friend (besides it is more enjoyable to do the activity with someone you like). 
  • Redundant Observation Systems: You can’t just rely on yourself to catch yourself cheating (or just failing). If you have a shared fitness worksheet, others will notice that you missed a day. They can text you a reminder. Automated systems like what we have on the phone are great as well.

Relying on Others: We can rely on our friends to motivate us. One way to capitalize on that is to create a group fitness spreadsheet and encourage each other. For instance, if your friend did not fill in yesterday’s workout, you can text them a reminder or a motivational message.

NudgeReduce friction in doing the planned activity. For example, place your phone outside your bedroom before bed or sleep in your running clothes.

Other Mental Tricks: There are two other helpful mental models for building habits. One is momentum, and the other is error correction. 

Momentum: P(do the thingt+1 | do the thing_t)

Error correction: P(do the thing_t+1 | !do the thing_t)

The best way to build momentum is to track streaks (famously used by Jerry Seinfeld). Not only do you get a reward every time you successfully complete the task, but the longer your streak, the less you want to break it.

Error correction on the other hand is turning a failure into motivation. Don’t miss two days in a row. Failure is part of the process, but do not let it compound. View the failure as step 0 of the next streak.

What Academics Can Learn From Industry

9 Aug

At its best, industry focuses people. It demands that people use everything at their disposal to solve a problem. It puts a premium on being lean, humble, agnostic, creative, and rigorous. Industry data scientists use qualitative methods, e.g., directly observe processes and people, do lean experimentation, build novel instrumentation, explore relationships between variables, and “dive deep” to learn about the problem. As a result, at any moment, they have a numerical account of the problem space, an idea about the blind spots, the next five places they want to dig, the next five ideas they want to test, and the next five things they want the company to build—things that they know work.

The social science research economy also focuses its participants. Except the focus is on producing broad, novel insights (which may or may not be true) and demonstrating intellectual heft and not on producing cost-effective solutions to urgent problems. The result is a surfeit of poor theories, a misunderstanding of how much the theories explain the issue at hand, and how widely they apply, a poor understanding of core social problems, and very few working solutions. 

The tide is slowly turning. Don Green, Jens Hainmeuller, Abhijit Banerjee, Esther Duflo, among others, form the avant-garde. Poor Economics by Banerjee and Duflo, in particular, comes the closest in spirit to how the industry works. It reminds me of how the best start-ups iterate to a product-market fit.

Self-Diagnosis

Ask yourself the following questions:

  1. Do you have in your mind a small set of numbers that explain your current understanding of the scale of the problem and some of its solutions?
  2. If you were to get a large sum of money, could you give a principled account of how you would spend it on research?
  3. Do you know what you are excited to learn about the problem (or potential solutions) in the next three months, year, …?

If you are committed to solving a problem, the answer to all the questions would be an unhesitant yes. Why? A numerical understanding of the problem is needed to make judgments about where you need to invest your time and money. It also guides what you would do if you had more money. And a focus on the problem means you have broken down the problem into solved and unsolved portions and know which unsolved portions of the problem you want to solve next. 

How to Solve Problems

Here are some rules of thumb (inspired by Abhijit Banerjee and Esther Duflo):

  1. What Problems to Solve? Work on Important Problems. The world is full of urgent social problems. Pick one. Calling whatever you are working on as important when it has a vague, multi-hop relation to an important problem doesn’t make it so. This decision isn’t without trade-offs. It is reasonable to fear the consequences when we substitute endless breadth with some focus. But we have tried that way and it is probably as good a time as any to try something else.
  2. Learn About The Problem: Social scientists seem to have more elaborate theory and “original” experiments than descriptions of data. It is time to switch that around. Take for instance malnutrition. Before you propose selling cut-rate rice, take a moment to learn whether the key problem that poor face is that they can’t afford the necessary calories or that they don’t get enough calories because they prefer tastier, more expensive calories than a full quota of calories. (This is an example from Poor Economics.) 
  3. Learn Theories in the Field: Judging by the output—books, and articles—the production of social science seems to be fueled mostly by the flash of insight. But there is only so much you can learn sitting in an armchair. Many key insights will go undiscovered if you don’t go to the field and closely listen and think. Abhijit Banerjee writes: “We then ran a similar experiment across several hundred villages where the goal was now to increase the number of immunized children. We found that gossips convince twice as many additional parents to vaccinate their children as random seeds or “trusted” people. They are about as effective as giving parents a small incentive (in the form of cell-phone minutes) for each immunized child and thus end up costing the government much less. Even though gossips proved incredibly successful at improving immunization rates, it is hard to imagine a policy of informing gossips emerging from conventional policy analysis. First, because the basic model of the decision to get one’s children immunized focuses on the costs and benefits to the family (Becker 1981) and is typically not integrated with models of social learning.”
  4. Solve Small Problems And Earn the Right to Saying Big General Things: The mechanism for deriving big theories in academia is the opposite of that used in the industry. In much of social science, insights are declared and understood as “general.” And important contextual dependencies are discovered over the years with research. In the industry, a solution is first tested in a narrow area. And then another. And if it works, we scale. The underlying hunch is that coming up with successful applications teaches us more about theory than the current model: come up with theory first, and produce posthoc rationalizations and add nuances when faced with failed predictions and applications. Going yet further, you could think that the purpose of social science is to find ways to fix a problem, which leads to more progress on understanding the problem and theory is a positive externality.

Suggested Reading + Sites

  1. Poor Economics by Abhijit Banerjee and Esther Duflo
  2. The Economist as Plumber by Esther Duflo
  3. Immigration Lab that asks, among other questions, why immigrants who are eligible for citizenship do not get citizenship especially when there are so many economic benefits to it. 
  4. Get Out the Vote by Don Green and Alan Gerber
  5. Cronbach (1975) highlights the importance of observation and context. A couple of memorable quotes:

    “From Occam to Lloyd Morgan, the canon has referred to parsimony in theorizing, not in observing. The theorist performs a dramatist’s function; if a plot with a few characters will tell the story, it is more satisfying than one with a crowded stage. But the observer should be a journalist, not a dramatist. To suppress a variation that might not recur is bad observing.”

    “Social scientists generally, and psychologists, in particular, have modeled their work on physical science, aspiring to amass empirical generalizations, to restructure them into more general laws, and to weld scattered laws into coherent theory. That lofty aspiration is far from realization. A nomothetic theory would ideally tell us the necessary and sufficient conditions for a particular result. Supplied the situational parameters A, B, and C, a theory would forecast outcome Y with a modest margin of error. But parameters D, E, F, and so on, also influence results, and hence a prediction from A, B, and C alone cannot be strong when D, E, and F vary freely.”

    “Though enduring systematic theories about man in society are not likely to be achieved, systematic inquiry can realistically hope to make two contributions. One reasonable aspiration is to assess local events accurately, to improve short-run control (Glass, 1972). The other reasonable aspiration is to develop explanatory concepts, concepts that will help people use their heads.”

Unsighted: Why Some Important Findings Remain Uncited

1 Aug

Poring over the first 500 citations of the over 900 citations for Fear and Loathing across Party Lines on Google Scholar (7/31/2020), I could not find a single study citing the paper for racial discrimination. You may think the reason is obvious—the paper is about partisan prejudice, not racial prejudice. But a more accurate description of the paper is that the paper is best known for describing partisan prejudice but has powerful evidence on the lack of racial discrimination among white Americans–in fact, there is reasonable evidence of positive discrimination in one study. (I exclude the IAT results, weaker than Banaji’s results, which show Cohen’s d ~ .22, because they don’t speak directly to discrimination.)

There are the two independent pieces of evidence in the paper about racial discrimination.

Candidate Selection Experiment

“Unlike partisanship where ingroup preferences dominate selection, only African Americans showed a consistent preference for the ingroup candidate. Asked to choose between two equally qualified candidates, the probability of an African American selecting an ingroup winnerwas .78 (95% confidence interval [.66, .87]), which was no different than their support for the more qualified ingroup candidate—.76 (95% confidence interval [.59, .87]). Compared to these conditions, the probability of African Americans selecting an outgroup winner was at its highest—.45—when the European American was most qualified (95% confidence interval [.26, .66]). The probability of a European American selecting an ingroup winner was only .42 (95% confidence interval [.34, .50]), and further decreased to .29 (95% confidence interval [.20, .40]) when the ingroup candidate was less qualified. The only condition in which a majority of European Americans selected their ingroup candidate was when the candidate was more qualified, with a probability of ingroup selection at .64 (95% confidence interval [.53, .74]).”

Evidence from Dictator and Trust Games

“From Figure 8, it is clear that in comparison with party, the effects of racial similarity proved negligible and not significant—coethnics were treated more generously (by eight cents, 95% confidence interval [–.11, .27]) in the dictator game, but incurred a loss (seven cents, 95% confidence interval [–.34, .20]) in the trust game. There was no interaction between partisan and racial similarity; playing with both a copartisan and coethnic did not elicit additional trust over and above the effects of copartisanship.”

There are two plausible explanations for the lack of citations. Both are easily ruled out. The first is that the quality of evidence for racial discrimination is worse than that for partisan discrimination. Given both claims use the same data and research design, that explanation doesn’t work. The second is that it is a difference in base rates of production of research on racial and partisan discrimination. A quick Google search debunks that theory. Between 2015 and 2020, I get 135k results for racial discrimination and 17k for partisan polarization. It isn’t exact but good enough to rule it out as a possibility for the results I see. This likely leaves us with just two explanations: a) researchers hesitate to cite results that run counter to their priors or their results, b) people are simply unaware of these results.

Addendum (9/26/2021): Why may people be unaware of the results? Here are some lay conjectures (which are general and NOT about the paper I use as an example above; I only use the paper as an example because I am familiar with it. See below on the reason):

  1. Papers, but especially paper titles and abstracts, are written around a single point because …
    1. Authors believe that this is a more effective way to write papers.
    2. Editors/reviewers recommend that the paper focus on one key finding or not focus on some findings — via Dean Eckles. (see the p.s. as well) The reason why some of the key results didn’t make the abstract in the paper I use as an example is, as Sean shares, because reviewers thought the results were not strong.)
  2. Authors may be especially reluctant to weave in ‘controversial’ supplementary findings in the abstract because …
    1. Sharing certain controversial results may cause reputational harm.
    2. Say the authors want to instill belief in A > B. Say a vast majority of readers have strong priors about: A > B and C > D. Say a method finds A > B and D > C. There are two ways to frame the paper. Talk about A > B and bury D > C. Or start with D > C and then show A > B. Which paper’s findings would be more widely believed?
  3. Papers are read far less often than paper titles and abstracts. And even when people read a paper, they are often doing a ‘motivated search’—looking for the relevant portion of the paper. (Good widely available within article search should principally help here.)

p.s. All of the above is about cases where papers have important supplementary results. But as Dean Eckles points out, sometimes the supplementary results are dropped at reviewers’ request, and sometimes (and this has happened to me), authors never find the energy to publish them elsewhere.

Gaming Measurement: Using Economic Games to Measure Discrimination

31 Jul

Prejudice is the bane of humanity. Measurement of prejudice, in turn, is a bane of social scientists. Self-reports are unsatisfactory. Like talk, they are cheap and thus biased and noisy. Implicit measures don’t even pass the basic hurdle of measurement—reliability. Against this grim background, economic games as measures of prejudice seem promising—they are realistic and capture costly behavior. Habyarimana et al. (HHPW for short) for instance, use the dictator game (they also have a neat variation of it which they call the ‘discrimination game’) to measure ethnic discrimination. Since then, many others have used the design, including prominently, Iyengar and Westwood (IW for short). But there are some issues with how economic games have been set up, analyzed, and interpreted:

  1. Revealing identity upfront gives you a ‘no personal information’ estimand: One common aspect of how economic games are setup is the party/tribe is revealed upfront. Revealing the trait upfront, however, may be sub-optimal. The likelier sequence of interaction and discovery of party/tribe in the world, especially as we move online, is regular interaction followed by discovery. To that end, a game where players interact for a few cycles before an ‘irrelevant’ trait is revealed about them is plausibly more generalizable. What we learn from such games can be provocative—-discrimination after a history of fair economic transactions seems dire. 
  2. Using data from subsequent movers can bias estimates. “For example, Burnham et al. (2000) reports that 68% of second movers primed by the word “partner” and 33% of second movers primed by the word “opponent” returned money in a single-shot trust game. Taken at face value, the experiment seems to show that the priming treatment increased by 35 percentage-points the rate at which second movers returned money. But this calculation ignores the fact that second movers were exposed to two stimuli, the 14 partner/opponent prime and the move of the first player. The former is randomly assigned, but the latter is not under experimental control and may introduce bias. ” (Green and Tusicisny) IW smartly sidestep the concern: “In both games, participants only took the role of Player 1. To minimize round-ordering concerns, there was no feedback offered at the end of each round; participants were told all results would be provided at the end of the study.”
  3. AMCE of conjoint experiments is subtle and subject to assumptions. The experiment in IW is a conjoint experiment: “For each round of the game, players were provided a capsule description of the second player, including information about the player’s age, gender, income, race/ethnicity, and party affiliation. Age was randomly assigned to range between 32 and 38, income varied between $39,000 and $42,300, and gender was fixed as male. Player 2’s partisanship was limited to Democrat or Republican, so there are two pairings of partisan similarity (Democrats and Republicans playing with Democrats and Republicans). The race of Player 2 was limited to white or African American. Race and partisanship were crossed in a 2 × 2, within-subjects design totaling four rounds/Player 2s.” The first subtlety is that AMCE for partisanship is identified against the distribution of gender, age, race, etc. For generalizability, we may want a distribution close to the real world. As Hainmeuller et al. write: “…use the real-world distribution (e.g., the distribution of the attributes of actual politicians) to improve external validity. The fact that the analyst can control how the effects are averaged can also be viewed as a potential drawback, however. In some applied settings, it is not necessarily clear what distribution of the treatment components analysts should use to anchor inferences. In the worst-case scenario, researchers may intentionally or unintentionally misrepresent their empirical findings by using weights that exaggerate particular attribute combinations so as to produce effects in the desired direction.” Second, there is always a chance that it is a particular higher-order combination, e.g., race–PID, that ‘explains’ the main effect. 
  4. Skew in outcome variables means that the mean is not a good summary statistic. As you see in the last line of the first panel of Table 4 (Republican—Republican Dictator Game), if you can take out the 20% of the people who give $0, the average allocation from others is $4.2. HHPW handle this with a variable called ‘egoist’ and IW handle it with a separate column tallying people giving precisely $0. 
  5. The presence of ‘white foreigners’ can make people behave more generously. As Dube et al. find, “the presence of a white foreigner increases player contributions by 19 percent.” The point is more general, of course. 

With that, here are some things we can learn from economic games in HHPW and IW:

  1. People are very altruistic. In HPPW: “The modal strategy, employed in 25% of the rounds, was to retain 400 USh and to allocate 300 USh to each of the other players. The next most common strategy was to keep 600 USh and to allocate 200 USh to each of the other players (21% of rounds). In the vast majority of allocations, subjects appeared to adhere to the norm that the two receivers should be treated equally. On average, subjects retained 540 shillings and allocated 230 shillings to each of the other players. The modal strategy in the 500 USh denomination game (played in 73% of rounds) was to keep one 500 USh coin and allocate the other to another player. Nonetheless, in 23% of the rounds, subjects allocated both coins to the other players.” In IW, “[of the $10, players allocated] nontrivial amounts of their endowment—a mean of $4.17 (95% confidence interval [3.91, 4.43]) in the trust game, and a mean of $2.88 (95% confidence interval [2.66, 3.10])” (Note: These numbers are hard to reconcile with numbers in Table 4. One plausible explanation is that these numbers are over the entire population and Table 4 numbers are a subset on partisans and independents are somewhat less generous than partisans.) 
  2. There is no co-ethnic bias. Both HHPW and IW find this. HHPW: “we find no evidence that this altruism was directed more at in-group members than at out-group members. [Table 2]” IW: “From Figure 8, it is clear that in comparison with party, the effects of racial similarity proved negligible and not significant—coethnics were treated more generously (by eight cents, 95% confidence interval [–.11, .27]) in the dictator game, but incurred a loss (seven cents, 95% confidence interval [–.34, .20]) in the trust game.”
  3. A modest proportion of people discriminate against partisans. IW: “The average amount allocated to copartisans in the trust game was $4.58 (95% confidence interval [4.33, 4.83]), representing a “bonus” of some 10% over the average allocation of $4.17. In the dictator game, copartisans were awarded 24% over the average allocation.” But it is less dramatic than that. The key change in the dictator game is the number of people giving $0. The change in the percentage of people giving $0 is 7% among Democrats. So the average amount of money given to R and D by people who didn’t give $0 is $4.1 and $4.4 respectively which is a ~ 7% diff. 
  4. More Republicans than Democrats act like ‘homo-economicus.’ I am just going by the proportion of respondents giving $0 in dictator games.

p.s. I was surprised that there are no replication scripts or even a codebook for IW. The data had been downloaded 275 times when I checked.

Predicting Reliable Respondents

23 Jul

Setting aside concerns about sampling, the quality of survey responses on popular survey platforms is abysmal (see here and here). Both insincere and inattentive respondents are at issue. A common strategy for identifying inattentive respondents is to use attention checks. However, many of these attention checks stick out like sore thumbs. The upshot is that an experience respondent can easily spot them. A parallel worry about attention checks is that inexperienced respondents may be confused by them. To address the concerns, we need a new way to identify inattentive respondents. One way to identify such respondents is to measure twice. More precisely, measure immutable or slowly changing traits, e.g., sex, education, etc., twice across closely spaced survey waves. Then, code cases where people switch answers across the waves on such traits as problematic. And then, use survey items, e.g., self-reports and metadata, e.g., survey response time, metadata on IP addresses, etc. in the first survey to predict problematic switches using modern ML techniques that allow variable selection like LASSO (space is at a premium). Assuming the equation holds, future survey creators can use the variables identified by LASSO to identify likely inattentive respondents.     

Self-Recommending: The Origins of Personalization

6 Jul

Recommendation systems are ubiquitous. They determine what videos and news you see, what books and products are ‘suggested’ to you, and much more. If asked about the origins of personalization, my hunch is that some of us will pin it to the advent of the Netflix Prize. Wikipedia goes further back—it puts the first use of the term ‘recommender system’ in 1990. But the history of personalization is much older. It is at least as old as heterogeneous treatment effects (though latent variable models might be a yet more apt starting point). I don’t know for how long we have known about heterogeneous treatment effects but it can be no later than 1957 (Cronbach and Goldine Gleser, 1957).  

Here’s Ed Haertel:

“I remember some years ago when NetFlix founder Reed Hastings sponsored a contest (with a cash prize) for data analysts to come up with improvements to their algorithm for suggesting movies subscribers might like, based on prior viewings. (I don’t remember the details.) A primitive version of the same problem, maybe just a seed of the idea, might be discerned in the old push in educational research to identify “aptitude-treatment interactions” (ATIs). ATI research was predicated on the notion that to make further progress in educational improvement, we needed to stop looking for uniformly better ways to teach, and instead focus on the question of what worked for whom (and under what conditions). Aptitudes were conceived as individual differences in preparation to profit from future learning (of a given sort). The largely debunked notion of “learning styles” like a visual learner, auditory learner, etc., was a naïve example. Treatments referred to alternative ways of delivering instruction. If one could find a disordinal interaction, such that one treatment was optimum for learners in one part of an aptitude continuum and a different treatment was optimum in another region of that continuum, then one would have a basis for differentiating instruction. There are risks with this logic, and there were missteps and misapplications of the idea, of course. Prescribing different courses of instruction for different students based on test scores can easily lead to a tracking system where high performing students are exposed to more content and simply get further and further ahead, for example, leading to a pernicious, self-fulfilling prophecy of failure for those starting out behind. There’s a lot of history behind these ideas. Lee Cronbach proposed the ATI research paradigm in a (to my mind) brilliant presidential address to the American Psychological Association, in 1957. In 1974, he once again addressed the American Psychological Association, on the occasion of receiving a Distinguished Contributions Award, and in effect said the ATI paradigm was worth a try but didn’t work as it had been conceived. (That address was published in 1975.)

This episode reminded me of the “longstanding principle in statistics, which is that, whatever you do, somebody in psychometrics already did it long before. I’ve noticed this a few times.”

Reading Cronbach today is also sobering in a way. It shows how ad hoc the investigation of theories and coming up with the right policy interventions was.

Interacting With Human Decisions

29 Jun

In sport, as in life, luck plays a role. For instance, in cricket, there is a toss at the start of the game. And the team that wins the toss wins the game 3% more often. The estimate of the advantage from winning the toss, however, is likely an underestimate of the maximum potential benefit of winning the toss. The team that wins the toss gets to decide whether to bat or bowl first. And 3% reflects the maximum benefit only when the team that won the toss chooses optimally.

The same point applies to estimates of heterogeneity. Say that you estimate how the probability of winning varies by the decision to bowl or bat first after winning the toss. (The decision to bowl or bat first is made before the toss.) And say, 75% of the time team that wins the toss chooses to bat first and wins these games 55% of the time. 25% of the time, teams decide to bowl first and win about 47% of these games. Winning rates of 55% and 47% would be likely yet higher if the teams chose optimally.

In the absence of other data, heterogeneous treatment effects give clear guidance on where the payoffs are higher. For instance, if you find that showing an ad on Chrome has a larger treatment effect, barring other information (and concerns), you may want to only show ads to people who use Chrome to increase the treatment effect. But the decision to bowl or bat first is not a traditional “covariate.” It is a dummy that captures the human judgment about pre-match observables. The interpretation of the interaction term thus needs care. For instance, in the example above, the winning percentage of 47% for teams that decide to bowl first looks ‘wrong’—how can the team that wins the toss lose more often than win in some cases? Easy. It can happen because the team decides to bowl in cases where the probability of winning is lower than 47%. Or it can be that the team is making a bad decision when opting to bowl first. 

Solving Problem Solving: Meta Skills For Problem Solving

21 Jun

Each problem is new in different ways. And mechanically applying specialized tools often doesn’t take you far. So beyond specialized tools, you need meta-skills.

The top meta-skill is learning. Immersing yourself in the area you are thinking about will help you solve problems better and quicker. Learning more broadly helps as well—it enables you to connect dots arrayed in unusual patterns.

Only second to learning is writing. Writing works because it is an excellent tool for thinking. Humans have limited memories, finite processing capacity, are overconfident, and are subject to ‘passions’ of the moment that occlude thinking. Writing reduces the malefic effects of these deficiencies.

By incrementally writing things down, you no longer have to store everything in the brain. Having a written copy also means that you can repeatedly go over the contents, which makes focusing on each of the points easier. But having something written also means you can `scan’ more quickly. Writing down, thus, also allows you to mix and match and form new combinations more easily.

Just as writing overcomes some of the limitations of our memory, it also improves our computational power. Writing allows us to overcome finite processing capacity by spreading the computation over time—run Intel 8088 for a long time, and you can solve reasonably complex problems.

Not all writing, however, will reduce overconfidence or overcome fuzzy thinking. For that, you need to write with the aim of genuine understanding and have enough humility, skepticism, motivation, and patience to see what you don’t know, learn what you don’t know, and apply what you have learned.

To make the most of writing, spread the writing over time. By distancing yourself from `passions’ of the moment—egoism, being enamored with an idea, etc.—you can see more clearly. So spread writing over time to see your words with a ‘fresh pair of eyes.’

The third meta-skill is talking. Like writing is not transcribing, talking is not recitation. If you don’t speak, some things will remain unthought. So speak to people. And there is no better set of people to talk to than a diverse set of others, people who challenge your implicit assumptions and give you new ways to think about a problem.

There are tricks to making discussions more productive. The first is separating discussions of problems from solutions and separating discussions about alternate solutions from discussions about which solution is better. There are compelling reasons behind the suggestion. If you kludge discussions of problems with solutions, people are liable to confuse unworkable solutions with problems. The second is getting opinions from the least powerful first—they are liable to defer to the more powerful. The third is keeping the tenor of discussion as ”intellectual pursuit of truth,” where getting it right is the only aim.

The fouth meta-skill, implicit in the third meta-skill but a separate skill, is relying on others. How we overcome our limitations is by relying on others. Knowing how to ask for help is an important skill. Find ways to get help—ask people to read what you have written, offer comments, ask them why you are wrong, how they would solve the problem, point you to literature, other people, etc.

99 Problems: How to Solve Problems

7 Jun

“Data is the new oil,” according to Clive Humby. But we have yet to build an engine that uses the oil efficiently and doesn’t produce a ton of soot. Using data to discover and triage problems is especially polluting. Working with data for well over a decade, I have learned some tricks that produce less soot and more light. Here’s a synopsis of a few things that I have learned.

  1. Is the Problem Worth Solving? There is nothing worse than solving the wrong problem. You spend time and money and get less than nothing in return—you squander the opportunity to solve the right problem. So before you turn to solutions, find out if the problem is worth solving.

    To illustrate the point, let’s follow Goji. Goji runs a delivery business. Goji’s business has an apparent problem. The company’s couriers have a habit of delivering late. At first blush, it seems like a big problem. But is it? To answer that, one good place to start is by quantifying how late the couriers arrive. Let’s say that most couriers arrive within 30 minutes of the appointment time. It seems promising but we still can’t tell whether it is good or bad. To find out, we could ask the customers. But asking customers is a bad idea. Even if the customers don’t care about their deliveries running late, it doesn’t cost them a dime to say that they care. Finding out how much they care is better. Find out the least amount of money the customers will happily accept in lieu of you running 30 minutes to the delivery. It may turn out that most customers don’t care—they will happily accept some trivial amount in lieu of a late delivery. Or it may turn out that customers only care when you deliver frozen or hot food. This still doesn’t give you the full picture. To get yet more clarity on the size of the problem, check how your price adjusted quality compares to other companies.

    Misestimating what customers will pay for something is just one of the ways to the wrong problem. Often, the apparent problem is merely an artifact of the measurement error. For instance, it may be that we think the couriers arrive late because our mechanism for capturing arrival is imperfect—couriers deliver on time but forget to tap the button acknowledging they have delivered. Automated check-in based on geolocation may solve the problem. Or incentivizing couriers to be prompt may solve it. But either way, the true problem is not late arrivals but mismeasurement.

    Wrong problems can be found in all parts of problem-solving. During software development, for instance, “[p]rogrammers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs,” according to Donald Knuth. (Knuth called the tendency “premature optimization.”) Worse, Knuth claims that “these attempts at efficiency actually ha[d] a strong negative impact” on how maintainable the code is.

    Often, however, you are not solving the wrong problem. You are just solving it at the wrong time. The conventional workflow of problem-solving is discovery, estimating opportunity, estimating investment, prioritizing, execution, and post-execution discovery, where you begin again. To find out what to focus on now, you need to get till prioritization. There are some rules of thumb, however, that can help you triage. 1. Fix upstream problems before downstream problems. The fixes upstream may make the downstream improvements moot. 2. diff the investment and returns based on optimal future workflow. If you don’t do that, you are committing to scrapping later a lot of what you build today. 3. Even on the best day, estimating the return on investment is a single decimal science. 4. You may find that there is no way to solve the problem with the people you have.
  1. MECE: Management consultants swear by it, so it can’t be a good idea. Right? It turns out that it is. Relentlessly working to pare down the problem into independent parts is among the most important tricks of the trade. Let’s see it in action. After looking at the data, Goji finds that arriving late is a big problem. So you know that it is the right problem but don’t know why your couriers are failing. You apply MECE. You reason that it could be because you have ‘bad’ couriers. Or because you are setting good couriers up for failure. These mutually exclusive comprehensively exhaustive parts can be broken down further. In fact, I think there is a law: the number of independent parts that a problem can be pared down is always one more than you think it is. Here, for instance, you may be setting couriers up to fail by giving them too little lead time or by not providing them precise directions. If you go down yet another layer, the short lead time may be a result of you taking too long to start looking for a courier or because it takes you a long time to find the right courier. So on and so forth. There is no magic to this. But there is no science to it either. MECE tells you what to do but not how to do it. We discuss how to in subsequent points.

  2. Funnel or the Plinko: The layered approach to MECE reminds most data scientists of the ‘funnel.’ Start with 100% and draw your Sankey diagram, popularized by Minard’s Napolean goes to Russia.

    Funnels are powerful tools capturing two important aspects: how much do we lose in each step, and where the losses come from. There is, however, one limitation of funnels—the need for categorical variables. When you have continuous variables, you need to decide smartly about how to discretize. Following the example we have been using, the heads-up we give to our couriers to pick something and deliver to the customer is one such continuous variable. Rather than break it into arbitrarily granular chunks, it is better to plot how lateness varies by lead time and then categorize at places where the slope changes dramatically.

    There are three things to be cautious about when building and using funnels. The first is that funnels treat correlation as causation. The second is Simpson’s paradox which deals with issues of aggregation in observational data. And the third is how coarseness of the funnel can lead to mistaken inferences. For instance, you may not see the true impact of having too little time to find a courier because you raise the prices where you have too little time.

  3. Systemic Thinking: It pays to know how the cookie is baked. Learn how the data flows through the system and what decisions we make at what point with what data and what assumptions to what purpose. The conventional tools are flow chart and process tracing. Keeping with our example, say we have a system that lets customers know when we are running late. And let’s assume that not only do we struggle to arrive on time, we also struggle to let people know when we are running late. An engineer may split the problem into an issue with detection or an issue with communication. The detection system may be made up of measuring where the courier is and estimating the time it takes to get to the destination. And either may be broken. And communication issues may be stem from problems with sending emails or issues with delivery, e.g., email being flagged as spam.

  4. Sample Failures: One way to diagnose problems is to look at a few examples closely. This is a good way to understand what could go wrong. For instance, it may allow you to discover that the locations you are getting from the couriers are wrong because the locations received a minute apart are hundreds of miles apart. This can then lead you to the diagnosis that your application is installed on multiple devices, and you cannot distinguish between data emitted by various devices.

  5. Worst Case: When looking at examples, start with the worst errors. The intuition is simple: worst errors are often the sites for obvious problems.

  6. Correlation is causation. To gain more traction, compare the worst with the best. Doing that allows you to see what is different between the two. The underlying idea is, of course, treating correlation as causation. And that is a famous warning. But often enough, correlation points in the right direction.

  7. Exploit the Skew: The Pareto principle—the 80/20 rule—holds in many places. Look for it. Rather than solve the entire pie, check if the opportunity is concentrated in small places. It often is. Pursuing our example above, it could be that a small proportion of our couriers account for a majority of the late deliveries. Or it could be that a small number of incorrect addresses our causing most of our late deliveries by waylaying couriers.

  8. Under good conditions, how often do we fail? How do you know how big of an issue a particular problem is? Say, for instance, you want to learn how big a role bad location data plays in our ability to notify. To do that, you should filter to cases where you have great location data and then see how well you can do. And then figure out the proportion of cases where we have great location data.

  9. Dr. House: The good doctor was a big believer in differential diagnosis. Dr. House often eliminated potential options by evaluating how patients responded to different treatment regimens. For instance, he would put people on an antibiotic course to eliminate infection as an option. The more general strategy is experimentation: learn by doing something.

    Experimentation is a sine-qua-non where people are involved. The impact of code is easy to simulate. But we cannot answer how much paying $10 per on-time delivery will increase on-time delivery. We need to experiment.

Trump Trumps All: Coverage of Presidents on Network Television News

4 May

With Daniel Weitzel.

The US government is a federal system, with substantial domains reserved for local and state governments. For instance, education, most parts of the criminal justice system, and a large chunk of regulation are under the purview of the states. Further, the national government has three co-equal branches: legislative, executive, and judicial. Given these facts, you would expect news coverage to be broad in its coverage of branches and the levels of government. But there is a sharp skew in news coverage of politicians, with members of the executive branch, especially national politicians (and especially the president), covered far more often than other politicians (see here). Exploiting data from Vanderbilt Television News Archive (VTNA), the largest publicly available database of TV news—over 1M broadcast abstracts spanning 1968 and 2019—we add body to the observation. We searched for references to the president during their presidency and coded each hit as 1. As the figure below shows, references to the president are common. Excluding Trump, on average, a sixth of all articles contain a reference to the sitting president. But Trump is different: 60%(!) of abstracts refer to Trump.

Data and scripts can be found here.