A Benchmark For Benchmarks

30 Dec

Benchmark datasets like MNIST, ImageNet, etc., abound in machine learning. Such datasets stimulate work on a problem by providing an agreed-upon mark to beat. Many of the benchmark datasets, however, are constructed in an ad hoc manner. As a result, it is hard to understand why the best-performing models vary across different benchmark datasets (see here), to compare models, and to confidently prognosticate about performance on a new dataset. To address such issues, in the following paragraphs, we provide a framework for building a good benchmark dataset.

I am looking for feedback. Please let me know how I can improve this.

Time Will Tell

23 Nov

Part of empirical social science is about finding fundamental truths about people. It is a difficult enterprise partly because scientists only observe data in a particular context. Neither cross-sectional variation nor data that goes back at best by tens of years is often enough to come up with generalizable truths. Longer observation windows help clarify what is an essential truth and what is, at best, a contextual truth. 

Support For Racially Justified and Targeted Affirmative Action

Sniderman and Carmines (1999) find that a large majority of Democrats and Republicans oppose racially justified and targeted affirmative action policies. They find that opposition to racially targeted affirmative action is not rooted in prejudice. Instead, they conjecture that it is rooted in adherence to the principle of equality. The authors don’t say it outright but the reader can surmise that in their view, opposition to racially justified and targeted affirmative action is likely to be continued and broad-based. It is a fair hypothesis. Except 20 years later, a majority of Democrats support racially targeted and racially justified affirmative action in education and hiring (see here).

What’s the Matter with “What’s the Matter with What’s the Matter with Kansas”?

It isn’t clear Bartels was right about Kansas even in 2004 (see here) (and that isn’t to say Thomas Frank was right) but the thesis around education has taken a nosedive. See below.

Split Ticket Voting For Moderation

On the back of record split ticket voting, Fiorina (and others) theorized “divided government is the result of a conscious attempt by the voters to achieve moderate policy.” Except very quickly split ticket voting declined (with of course no commensurate radicalization of the population) (see here).

Effect of Daughters on Legislator Ideology

Having daughters was thought to lead politicians to vote more liberally (see here) but more data suggested that this stopped in the polarized era (see here). Yet more data suggested that there was no trend for legislators with daughters to vote liberally before the era covered by the first study (see here).

Why Social Scientists Fail to Predict Dramatic Social Changes

19 Nov

Soviet specialists are often derided for their inability to see the coming collapse of the Soviet Union. But they were not unique. If you look around, social scientists have very little handle on many of the big social changes that have happened over the past 70 or so years.

  1. Dramatic decline in smoking. “The percentage of adults who smoke tobacco has declined from 42% in 1965 (the first year the CDC measured this), to 12.5% in 2020.” (see here.)
  2. Large infrastructure successes in a corrupt, divided developing nation. Over the last 20 or so years, India has pulled off Aadhar, UPI, FastPass, etc., dramatically increased the number of electrified villages, the number of people with access to toilets, the length of highways, etc. 
  3. Dramatic reductions in prejudice against Italians, the Irish, Asians, Women, African Americans, LGBT, etc. (see here, here, etc.)
  4. Dramatic decline in religion, e.g., church-going, etc., in the West.
  5. Dramatic decline in marriage. “According to the study, the marriage rate in 1970 was at 76.5%, and today, it stands at just over 31%.” (see here.)
  6. Obama or Trump. Not many would have given the odds of America electing a black president in 2006. Or electing Trump in 2016.

The list probably spans all the big social changes. How many would have bet on the success of China? Or for what matter Bangladesh, whose HDI are at par or ahead of its more established South Asian neighbors? Or the dramatic liberalization that is underway in Saudi Arabia? After all, the conventional argument before MBS was that the Saudi monarchy had made a deal with the mullahs and that any change would be met with a strong backlash.

All of that begs the question: why? One reason social scientists fail to predict dramatic social change may be because they think the present reflects the equilibrium. For instance, take racial attitudes. The theories about racial prejudice have mostly been defined by the idea that prejudice is irreducible. The second reason may be that most data that social scientists have is cross-sectional or collected over short periods and there isn’t much you can see (especially about change) from small portholes. The primary evidence they have is about lack of change when world looked over longer time spans is defined by astounding change on many dimensions. The third reason may be that social scientists suffer from negativity bias. They are focused on explaining what’s wrong with the world and interpreting data in ways that highlight conventional anxieties. This means that they end up interrogating progress (which is a fine endeavor) but spend too little time acknowledging and explaining real progress. Ideology also likely plays a role. For instance, few notice the long standing progressive racial bias in television; see here for a fun example of the interpretation gymnastics.

p.s. Often, social scientists not just fail to predict but struggle to explain what underlies the dramatic changes years later. Worse, social scientists do not seem to change their mental models based on the changes.

p.p.s. So what changes do I predict? I predict a dramatic decline in caste prejudice in India because of the following reasons: 1. dramatic generational turnover, 2. urbanization, 3. uninformative last names (outside of local context and excluding a maximum of 20% of the last names, e.g., last name ‘kumar’, which means ‘boy’, is exceedingly common, 4. high intra-group variance in physical features, 5. the preferred strategy for a prominent political party is to minimize intra-Hindu religious differences, 6. the current media + religious elites are mostly against caste prejudice. I also expect fairly rapid declines in prejudice against women (though far less steeper than caste) given some of the same reasons.

Missing Market for Academics

16 Nov

There are a few different options for buying time with industry experts, e.g., https://officehours.com/, https://intro.co/, etc. However, there is no marketplace for buying academics’ time. Some surplus is likely lost as a result. For one, some academics want advice on what they write. To get advice, they have three choices—academic friends, reviewers, or interested academics at conferences or talks. All three have their problems. Or they have to resort to informal markets like Kahneman. 

“He called a young psychologist he knew well and asked him to find four experts in the field of judgment and decision-making, and offer them $2,000 each to read his book and tell him if he should quit writing it. “I wanted to know, basically, whether it would destroy my reputation,” he says. He wanted his reviewers to remain anonymous, so they might trash his book without fear of retribution.”

https://www.vanityfair.com/news/2011/12/michael-lewis-201112

For what it’s worth, Kahneman’s book still had major errors. And that may be the point. Had he access to a better market, with ratings on the ability to review quantitative material, he may not have had the errors. A fully fleshed market could offer options to workers to price discriminate based on whether the author is a graduate student or a tenured professor at a top-ranked private university. Such a market may also prove a useful revenue stream for academics with time and talent who want additional money.

Reviewing is but one example. Advice on navigating the academic job market, research design, etc., can all be sold.

Cracking the Code: Addressing Some of the Challenges in Research Software

2 Jul

Macro Concerns

  1. Lack of Incentives for Producing High-Quality Software. Software’s role in enabling and accelerating research cannot be overstated. But the incentives for producing software in academia are still very thin. One reason is that people do not cite the software they use; the academic currency is still citations.
  2. Lack Ways to Track the Consequences of Software Bugs (Errors). (Quantitative) Research outputs are a function of the code researchers write themselves and the third-party software they use. Let’s assume that the peer review process vets the code written by the researcher. This leaves code written by third-party developers. What precludes errors in third-party code? Not much. The code is generally not peer-reviewed though there are efforts underway. Conditional on errors being present, there is no easy way to track bugs and their impact on research outputs.
  3. Developers Lack Data on How the Software is Being (Mis)Used. The modern software revolution hasn’t caught up with the open-source research software community. Most open-source research software is still distributed as a binary and emits no logs that can be analyzed by the developer. The only way a developer becomes aware of an issue is when a user reports the issues. This leaves errors that don’t cause alerts or failures, e.g., when a user user passes data that is inconsistent with the assumptions made when designing the software, and other insights about how to improve the software based on usage. 

Conventional Reference Lists Are the Wrong Long-Term Solution for #1 and #2

Unlike ideas, which need to be explicitly cited, software dependencies are naturally made explicit in the code. Thus, there is no need for conventional reference lists (~ a bad database). If all the research code is committed to a system like Github (Dataverse lacks the tools for #2) with enough meta information about (the precise version of the) third-party software being used, e.g., import statements in R, etc., we can create a system like the Github dependency graph to calculate the number of times software has been used (and these metrics can be shown on Google Scholar, etc.) and also create systems that trigger warnings to authors when consequential updates to underlying software are made. (See also https://gojiberries.io/2019/03/22/countpy-incentivizing-more-and-better-software/).

Conventional reference lists may however be the right short-term solution. But the goalpost moves to how to drive citations. One reason researchers do not cite software is that they don’t see others doing it. One way to cue that software should be cited is to show a message when the software is loaded — please cite the software. Such a message can also serve as a reminder for people who merely forget to cite the software. For instance, my hunch is that one of the stargazer has been cited more than 1,000 times (June 2023) is because the package produces a message .onAttach to remind the user to cite the package. (See more here.)

Solution for #3

Spin up a server that open source developers can use to collect logs. Provide tools to collect remote logs. (Sample code.)

p.s. Here’s code for deriving software citations statistics from replication files.

Interpreting Data

26 Sep

It is a myth that data speaks for itself. The analyst speaks for the data. The analyst chooses what questions to ask, what analyses to run, and how they are interpreted and summarized.

I use excerpts from a paper by Gilliam et al. on the media portrayal of crime to walk through one set of choices made by a group of analysts. (The excerpts also highlight the need for reading a paper fully than relying on the abstract alone.)

Abstract

From Gilliam et al.; Abstract.

White Violent Criminals Are Overrepresented

From Gilliam et al.; Bottom of page 10.

White Nonviolent Criminals Are Overrepresented

From Gilliam et al.; first paragraph on page 12

Relative Underrepresentation Between Violent and Nonviolent Crime is a Problem

From Gilliam et al.; Last paragraph on page 12
From Gilliam et al.; First paragraph on page 13

Compare the above with the following figure and interpretation from Reaching Beyond Race by Sniderman and Carmines. Rather than focus on the middle two peaks: 28 vs. 43, Sniderman and Carmines write: “we were struck by the relative absence of racial polarization.” (Added on 10/4/2023)

The Story of Science: Storytelling Bias in Science

7 Jun

Often enough, scientists are left with the unenviable task of conducting an orchestra with out-of-tune instruments. They are charged with telling a coherent story about noisy results. Scientists defer to the demand partly because there is a widespread belief that a journal article is the appropriate grouping variable at which results should ‘make sense.’

To tell coherent stories with noisy data, scientists resort to a variety of underhanded methods. The first is simply squashing the inconvenient results—never reporting them or leaving them to the appendix or couching the results in the language of the trade, e.g., “the result is only marginally significant” or “the result is marginally significant” or “tight confidence bounds” (without ever talking about the expected effect size). Secondly, if good statistics show uncongenial results, drown the data in bad statistics, e.g., report the difference between a significant and an insignificant effect as significant. The third trick is overfitting. A sin in machine learning is a virtue in scientific storytelling. Come up with fanciful theories that could explain the result and make that the explanation. The fourth is to practice the “have your cake and eat it too” method of writing. Proclaim big results at the top and offer a thick word soup in the main text. The fifth is to practice abstinence—abstain from interpreting ‘inconsistent’ results as coming from a lack of power, bad theorizing, or heterogeneous effects.

The worst outcome of all of this malaise is that many (expectedly) become better at what they practice—bad science and nimble storytelling.

94.5% Certain That Covid Vaccine Will Be Less Than 94.5% Effective

16 Nov

“On Sunday, an independent monitoring board broke the code to examine 95 infections that were recorded starting two weeks after volunteers’ second dose — and discovered all but five illnesses occurred in participants who got the placebo.”

Moderna Says Its COVID-19 Vaccine Is 94.5% Effective In Early Tests

The data = control group is 5 out of 15k and the treatment group is 90 out of 15k. The base rate (control group) is .6%. When the base rate is so low, it is generally hard to be confident about the ratio (1 – (5/95)). But noise is not the same as bias. One reason to think why 94.5% is an overestimate is simply because 94.5% is pretty close to the maximum point on the scale.

The other reason to worry about 94.5% is that the efficacy of a Flu vaccine is dramatically lower. (There is a difference in the time horizons over which effectiveness is measured for Flu for Covid, with Covid being much shorter, but useful to take that as a caveat when trying to project the effectiveness of Covid vaccine.)

STEMing the Rot: Does Relative Deprivation Explain Low STEM Graduation Rates at Top Schools?

26 Sep

The following few paragraphs are from Sociation Today:


Using the work of Elliot (et al. 1996), Gladwell compares the proportion of each class which gets a STEM degree compared to the math SAT at Hartwick College and Harvard University.  Here is what he presents for Hartwick:

Students at Hartwick College

STEM MajorsTop ThirdMiddle ThirdBottom Third
Math SAT569472407
STEM degrees55.0%27.1%17.8

So the top third of students with the Math SAT as the measure earn over half the science degrees. 

    What about Harvard?   It would be expected that Harvard students would have much higher Math SAT scores and thus the distribution would be quite different.  Here are the data for Harvard:

Students at Harvard University

STEM MajorsTop ThirdMiddle ThirdBottom Third
Math SAT753674581
STEM degrees53.4%31.2%15.4%

     Gladwell states the obvious, in italics, “Harvard has the same distribution of science degrees as Hartwick,” p. 83. 

    Using his reference theory of being a big fish in a small pond, Gladwell asked Ms. Sacks what would have happened if she had gone to the University of Maryland and not Brown. She replied, “I’d still be in science,” p. 94.


Gladwell focuses on the fact that the bottom-third at Harvard is the same as the top third at Hartwick. And points to the fact that they graduate at very different rates. It is a fine point. But there is more to the data. The top-third at Harvard have much higher SAT scores than the top-third at Hartwick. Why is it the case that they graduate with a STEM degree at the same rate as the top-third at Hartwick? One answer to that is that STEM degrees at Harvard are harder. So harder coursework at Harvard (vis-a-vis Hartwick) is another explanation for the pattern we see in the data and, in fact, fits the data better as it explains the performance of the top-third at Harvard.

Here’s another way to put the point: If preferences for graduating in STEM are solely and almost deterministically explained by Math SAT scores, like Gladwell implicitly assumes, and the major headwinds are because of relative standing, then we should see a much higher STEM graduation rate for the top-third at Harvard. We should ideally see an intercept shift across schools, which we don’t see, but a common differential between the top and the bottom third.

The (Mis)Information Age: Measuring and Improving ‘Digital Literacy’

31 Aug

The information age has bought both bounty and pestilence. Today, we are deluged with both correct and incorrect information. If we knew how to tell apart correct claims from incorrect, we would have inched that much closer to utopia. But the lack of nous in telling apart generally ‘obvious’ incorrect claims from correct claims has brought us close to the precipice of disarray. Thus, improving people’s ability to identify untrustworthy claims as such takes on urgency.

Before we find fixes, it is good to measure how bad things are and what things are bad. This is the task the following paper sets itself by creating a ‘digital literacy’ scale. (Digital literacy is an overloaded term. It means many different things, from the ability to find useful information, e.g., information about schools or government programs, to the ability to protect yourself against harm online (see here and here for how frequently people’s accounts are breached and how often they put themselves at risk of malware or phishing), to the ability to identify incorrect claims as such, which is how the paper uses it.)

Rather than build a skill assessment kind of a scale, the paper measures (really predicts) skills indirectly using some other digital literacy scales, whose primary purpose is likely broader. The paper validates the importance of various constituent items using variable importance and model fit kinds of measures. There are a few dangers of doing that:

  1. Inference using surrogates is dangerous as the weakness of surrogates cannot be fully explored with one dataset. And they are liable not to generalize as underlying conditions change. We ideally want measures that directly measure the construct.
  2. Variable importance is not the same as important variables. For instance, it isn’t clear why “recognition of the term RSS,” the “highest-performing item by far” has much to do with skill in identifying untrustworthy claims.

Some other work builds uncalibrated measures of digital literacy (conceived as in the previous paper). As part of an effort to judge the efficacy of a particular way of educating people about how to judge untrustworthy claims, the paper provides measures of trust in claims. The topline is that educating people is not hard (see the appendix for the description of the treatment). A minor treatment (see below) is able to improve “discernment between mainstream and false news headlines.”

Understandably, the effects of this short treatment are ‘small.’ The ITT short-term effect in the US is: “a decrease of nearly 0.2 points on a 4-point scale.” Later in the manuscript, the authors provide the substantive magnitude of the .2 pt net swing using a binary indicator of perceived headline accuracy: “The proportion of respondents rating a false headline as “very accurate” or “somewhat accurate” decreased from 32% in the control condition to 24% among respondents who were assigned to the media literacy intervention in wave 1, a decrease of 7 percentage points.” The .2 pt. net swing on a 4 point scale leading to a 7% difference is quite remarkable and generally suggests that there is a lot of ‘reverse’ intra-category movement that the crude dichotomization elides over. But even if we take the crude categories as the quantity of interest, a month later in the US, the 7 percent swing is down to 4 percent:

“…the intervention reduced the proportion of people endorsing false headlines as accurate from 33 to 29%, a 4-percentage-point effect. By contrast, the proportion of respondents who classified mainstream news as not very accurate or not at all accurate rather than somewhat or very accurate decreased only from 57 to 55% in wave 1 and 59 to 57% in wave 2.

Guess et al. 2020

The opportunity to mount more ambitious treatments remains sizable. So does the opportunity to more precisely understand what aspects of the quality of evidence people find hard to discern. And how we could release products that make their job easier.

Survey Experiments With Truth: Learning From Survey Experiments

27 Aug

Tools define science. Not only do they determine how science is practiced but also what questions are asked. Take survey experiments, for example. Since the advent of online survey platforms, which made conducting survey experiments trivial, the lure of convenience and internal validity has persuaded legions of researchers to use survey experiments to understand the world.

Conventional survey experiments are modest tools. Paul Sniderman writes,

“These three limitations of survey experiments—modesty of treatment, modesty of scale, and modesty of measurement—need constantly to be borne in mind when brandishing term experiment as a prestige enhancer.” I think we can easily collapse these in two — treatment (which includes ‘scale’ as he defines it— the amount of time) and measurement.

Paul Sniderman

Note: We can collapse these three concerns into two— treatment (which includes ‘scale’ as Paul defines it— the amount of time) and measurement.

But skillful artisans have used this modest tool to great effect. Famously, Kahneman and Tversky used survey experiments, e.g., Asian Disease Problem, to shed light on how people decide. More recently, Paul Sniderman and Tom Piazza have used survey experiments to shed light on an unsavory aspect of human decision making: discrimination. Aside from shedding light on human decision making, researchers have also used survey experiments to understand what survey measures mean, e.g., Ahler and Sood

The good, however, has come with the bad; insight has often come with irreflection. In particular, Paul Sniderman implicitly points to two common mistakes that people make:

  1. Not Learning From the Control Group. The focus on differences in means means that we sometimes fail to reflect on what the data in the Control Group tells us about the world. Take the paper on partisan expressive responding, for instance. The topline from the paper is that expressive responding explains half of the partisan gap. But it misses the bigger story—the partisan differences in the Control Group are much smaller than what people expect, just about 6.5% (see here). (Here’s what I wrote in 2016.)
  2. Not Putting the Effect Size in Context. A focus on significance testing means that we sometimes fail to reflect on the modesty of effect sizes. For instance, providing people $1 for a correct answer within the context of an online survey interview is a large premium. And if providing a dollar each on 12 (included) questions nudges people from an average of 4.5 correct responses to 5, it suggests that people are resistant to learning or impressively confident that what they know is right. Leaving $7 on the table tells us more than the .5, around which the paper is written. 

    More broadly, researchers are obtuse to the point that sometimes what the results show is how impressively modest the movement is when you ratchet up the dosage. For instance, if an overwhelming number of African Americans favor Whites who have scored just a few points more than a Black student, it is a telling testament to their endorsement of meritocracy.

Nothing to See Here: Statistical Power and “Oversight”

13 Aug

“Thus, when we calculate the net degree of expressive responding by subtracting the acceptance effect from the rejection effect—essentially differencing off the baseline effect of the incentive from the reduction in rumor acceptance with payment—we find that the net expressive effect is negative 0.5%—the opposite sign of what we would expect if there was expressive responding. However, the substantive size of the estimate of the expressive effect is trivial. Moreover, the standard error on this estimate is 10.6, meaning the estimate of expressive responding is essentially zero.

https://journals.uchicago.edu/doi/abs/10.1086/694258

(Note: This is not a full review of all the claims in the paper. There is more data in the paper than in the quote above. I am merely using the quote to clarify a couple of statistical points.)

There are two main points:

  1. The fact that estimate is close to zero and the s.e. is super fat are technically unrelated. The last line of the quote, however, seems to draw a relationship between the two.
  2. The estimated effect sizes of expressive responding in the literature are much smaller than the s.e. Bullock et al. (Table 2) estimate the effect of expressive responding at about 4% and Prior et al. (Figure 1) at about ~ 5.5% (“Figure 1(a) shows, the model recovers the raw means from Table 1, indicating a drop in bias from 11.8 to 6.3.”). Thus, one reasonable inference is that the study is underpowered to reasonably detect expected effect sizes.

What Academics Can Learn From Industry

9 Aug

At its best, industry focuses people. It demands that people use everything at their disposal to solve a problem. It puts a premium on being lean, humble, agnostic, creative, and rigorous. Industry data scientists use qualitative methods, e.g., directly observe processes and people, do lean experimentation, build novel instrumentation, explore relationships between variables, and “dive deep” to learn about the problem. As a result, at any moment, they have a numerical account of the problem space, an idea about the blind spots, the next five places they want to dig, the next five ideas they want to test, and the next five things they want the company to build—things that they know work.

The social science research economy also focuses its participants. Except the focus is on producing broad, novel insights (which may or may not be true) and demonstrating intellectual heft and not on producing cost-effective solutions to urgent problems. The result is a surfeit of poor theories, a misunderstanding of how much the theories explain the issue at hand, and how widely they apply, a poor understanding of core social problems, and very few working solutions. 

The tide is slowly turning. Don Green, Jens Hainmeuller, Abhijit Banerjee, Esther Duflo, among others, form the avant-garde. Poor Economics by Banerjee and Duflo, in particular, comes the closest in spirit to how the industry works. It reminds me of how the best start-ups iterate to a product-market fit.

Self-Diagnosis

Ask yourself the following questions:

  1. Do you have in your mind a small set of numbers that explain your current understanding of the scale of the problem and some of its solutions?
  2. If you were to get a large sum of money, could you give a principled account of how you would spend it on research?
  3. Do you know what you are excited to learn about the problem (or potential solutions) in the next three months, year, …?

If you are committed to solving a problem, the answer to all the questions would be an unhesitant yes. Why? A numerical understanding of the problem is needed to make judgments about where you need to invest your time and money. It also guides what you would do if you had more money. And a focus on the problem means you have broken down the problem into solved and unsolved portions and know which unsolved portions of the problem you want to solve next. 

How to Solve Problems

Here are some rules of thumb (inspired by Abhijit Banerjee and Esther Duflo):

  1. What Problems to Solve? Work on Important Problems. The world is full of urgent social problems. Pick one. Calling whatever you are working on as important when it has a vague, multi-hop relation to an important problem doesn’t make it so. This decision isn’t without trade-offs. It is reasonable to fear the consequences when we substitute endless breadth with some focus. But we have tried that way and it is probably as good a time as any to try something else.
  2. Learn About The Problem: Social scientists seem to have more elaborate theory and “original” experiments than descriptions of data. It is time to switch that around. Take for instance malnutrition. Before you propose selling cut-rate rice, take a moment to learn whether the key problem that poor face is that they can’t afford the necessary calories or that they don’t get enough calories because they prefer tastier, more expensive calories than a full quota of calories. (This is an example from Poor Economics.) 
  3. Learn Theories in the Field: Judging by the output—books, and articles—the production of social science seems to be fueled mostly by the flash of insight. But there is only so much you can learn sitting in an armchair. Many key insights will go undiscovered if you don’t go to the field and closely listen and think. Abhijit Banerjee writes: “We then ran a similar experiment across several hundred villages where the goal was now to increase the number of immunized children. We found that gossips convince twice as many additional parents to vaccinate their children as random seeds or “trusted” people. They are about as effective as giving parents a small incentive (in the form of cell-phone minutes) for each immunized child and thus end up costing the government much less. Even though gossips proved incredibly successful at improving immunization rates, it is hard to imagine a policy of informing gossips emerging from conventional policy analysis. First, because the basic model of the decision to get one’s children immunized focuses on the costs and benefits to the family (Becker 1981) and is typically not integrated with models of social learning.”
  4. Solve Small Problems And Earn the Right to Saying Big General Things: The mechanism for deriving big theories in academia is the opposite of that used in the industry. In much of social science, insights are declared and understood as “general.” And important contextual dependencies are discovered over the years with research. In the industry, a solution is first tested in a narrow area. And then another. And if it works, we scale. The underlying hunch is that coming up with successful applications teaches us more about theory than the current model: come up with theory first, and produce posthoc rationalizations and add nuances when faced with failed predictions and applications. Going yet further, you could think that the purpose of social science is to find ways to fix a problem, which leads to more progress on understanding the problem and theory is a positive externality.

Suggested Reading + Sites

  1. Poor Economics by Abhijit Banerjee and Esther Duflo
  2. The Economist as Plumber by Esther Duflo
  3. Immigration Lab that asks, among other questions, why immigrants who are eligible for citizenship do not get citizenship especially when there are so many economic benefits to it. 
  4. Get Out the Vote by Don Green and Alan Gerber
  5. Cronbach (1975) highlights the importance of observation and context. A couple of memorable quotes:

    “From Occam to Lloyd Morgan, the canon has referred to parsimony in theorizing, not in observing. The theorist performs a dramatist’s function; if a plot with a few characters will tell the story, it is more satisfying than one with a crowded stage. But the observer should be a journalist, not a dramatist. To suppress a variation that might not recur is bad observing.”

    “Social scientists generally, and psychologists, in particular, have modeled their work on physical science, aspiring to amass empirical generalizations, to restructure them into more general laws, and to weld scattered laws into coherent theory. That lofty aspiration is far from realization. A nomothetic theory would ideally tell us the necessary and sufficient conditions for a particular result. Supplied the situational parameters A, B, and C, a theory would forecast outcome Y with a modest margin of error. But parameters D, E, F, and so on, also influence results, and hence a prediction from A, B, and C alone cannot be strong when D, E, and F vary freely.”

    “Though enduring systematic theories about man in society are not likely to be achieved, systematic inquiry can realistically hope to make two contributions. One reasonable aspiration is to assess local events accurately, to improve short-run control (Glass, 1972). The other reasonable aspiration is to develop explanatory concepts, concepts that will help people use their heads.”

Unsighted: Why Some Important Findings Remain Uncited

1 Aug

Poring over the first 500 citations of the over 900 citations for Fear and Loathing across Party Lines on Google Scholar (7/31/2020), I could not find a single study citing the paper for racial discrimination. You may think the reason is obvious—the paper is about partisan prejudice, not racial prejudice. But a more accurate description of the paper is that the paper is best known for describing partisan prejudice but has powerful evidence on the lack of racial discrimination among white Americans–in fact, there is reasonable evidence of positive discrimination in one study. (I exclude the IAT results, weaker than Banaji’s results, which show Cohen’s d ~ .22, because they don’t speak directly to discrimination.)

There are the two independent pieces of evidence in the paper about racial discrimination.

Candidate Selection Experiment

“Unlike partisanship where ingroup preferences dominate selection, only African Americans showed a consistent preference for the ingroup candidate. Asked to choose between two equally qualified candidates, the probability of an African American selecting an ingroup winnerwas .78 (95% confidence interval [.66, .87]), which was no different than their support for the more qualified ingroup candidate—.76 (95% confidence interval [.59, .87]). Compared to these conditions, the probability of African Americans selecting an outgroup winner was at its highest—.45—when the European American was most qualified (95% confidence interval [.26, .66]). The probability of a European American selecting an ingroup winner was only .42 (95% confidence interval [.34, .50]), and further decreased to .29 (95% confidence interval [.20, .40]) when the ingroup candidate was less qualified. The only condition in which a majority of European Americans selected their ingroup candidate was when the candidate was more qualified, with a probability of ingroup selection at .64 (95% confidence interval [.53, .74]).”

Evidence from Dictator and Trust Games

“From Figure 8, it is clear that in comparison with party, the effects of racial similarity proved negligible and not significant—coethnics were treated more generously (by eight cents, 95% confidence interval [–.11, .27]) in the dictator game, but incurred a loss (seven cents, 95% confidence interval [–.34, .20]) in the trust game. There was no interaction between partisan and racial similarity; playing with both a copartisan and coethnic did not elicit additional trust over and above the effects of copartisanship.”

There are two plausible explanations for the lack of citations. Both are easily ruled out. The first is that the quality of evidence for racial discrimination is worse than that for partisan discrimination. Given both claims use the same data and research design, that explanation doesn’t work. The second is that it is a difference in base rates of production of research on racial and partisan discrimination. A quick Google search debunks that theory. Between 2015 and 2020, I get 135k results for racial discrimination and 17k for partisan polarization. It isn’t exact but good enough to rule it out as a possibility for the results I see. This likely leaves us with just two explanations: a) researchers hesitate to cite results that run counter to their priors or their results, b) people are simply unaware of these results.

Addendum (9/26/2021): Why may people be unaware of the results? Here are some lay conjectures (which are general and NOT about the paper I use as an example above; I only use the paper as an example because I am familiar with it. See below on the reason):

  1. Papers, but especially paper titles and abstracts, are written around a single point because …
    1. Authors believe that this is a more effective way to write papers.
    2. Editors/reviewers recommend that the paper focus on one key finding or not focus on some findings — via Dean Eckles. (see the p.s. as well) The reason why some of the key results didn’t make the abstract in the paper I use as an example is, as Sean shares, because reviewers thought the results were not strong.)
  2. Authors may be especially reluctant to weave in ‘controversial’ supplementary findings in the abstract because …
    1. Sharing certain controversial results may cause reputational harm.
    2. Say the authors want to instill belief in A > B. Say a vast majority of readers have strong priors about: A > B and C > D. Say a method finds A > B and D > C. There are two ways to frame the paper. Talk about A > B and bury D > C. Or start with D > C and then show A > B. Which paper’s findings would be more widely believed?
  3. Papers are read far less often than paper titles and abstracts. And even when people read a paper, they are often doing a ‘motivated search’—looking for the relevant portion of the paper. (Good widely available within article search should principally help here.)

p.s. All of the above is about cases where papers have important supplementary results. But as Dean Eckles points out, sometimes the supplementary results are dropped at reviewers’ request, and sometimes (and this has happened to me), authors never find the energy to publish them elsewhere.

Gaming Measurement: Using Economic Games to Measure Discrimination

31 Jul

Prejudice is the bane of humanity. Measurement of prejudice, in turn, is a bane of social scientists. Self-reports are unsatisfactory. Like talk, they are cheap and thus biased and noisy. Implicit measures don’t even pass the basic hurdle of measurement—reliability. Against this grim background, economic games as measures of prejudice seem promising—they are realistic and capture costly behavior. Habyarimana et al. (HHPW for short) for instance, use the dictator game (they also have a neat variation of it which they call the ‘discrimination game’) to measure ethnic discrimination. Since then, many others have used the design, including prominently, Iyengar and Westwood (IW for short). But there are some issues with how economic games have been set up, analyzed, and interpreted:

  1. Revealing identity upfront gives you a ‘no personal information’ estimand: One common aspect of how economic games are setup is the party/tribe is revealed upfront. Revealing the trait upfront, however, may be sub-optimal. The likelier sequence of interaction and discovery of party/tribe in the world, especially as we move online, is regular interaction followed by discovery. To that end, a game where players interact for a few cycles before an ‘irrelevant’ trait is revealed about them is plausibly more generalizable. What we learn from such games can be provocative—-discrimination after a history of fair economic transactions seems dire. 
  2. Using data from subsequent movers can bias estimates. “For example, Burnham et al. (2000) reports that 68% of second movers primed by the word “partner” and 33% of second movers primed by the word “opponent” returned money in a single-shot trust game. Taken at face value, the experiment seems to show that the priming treatment increased by 35 percentage-points the rate at which second movers returned money. But this calculation ignores the fact that second movers were exposed to two stimuli, the 14 partner/opponent prime and the move of the first player. The former is randomly assigned, but the latter is not under experimental control and may introduce bias. ” (Green and Tusicisny) IW smartly sidestep the concern: “In both games, participants only took the role of Player 1. To minimize round-ordering concerns, there was no feedback offered at the end of each round; participants were told all results would be provided at the end of the study.”
  3. AMCE of conjoint experiments is subtle and subject to assumptions. The experiment in IW is a conjoint experiment: “For each round of the game, players were provided a capsule description of the second player, including information about the player’s age, gender, income, race/ethnicity, and party affiliation. Age was randomly assigned to range between 32 and 38, income varied between $39,000 and $42,300, and gender was fixed as male. Player 2’s partisanship was limited to Democrat or Republican, so there are two pairings of partisan similarity (Democrats and Republicans playing with Democrats and Republicans). The race of Player 2 was limited to white or African American. Race and partisanship were crossed in a 2 × 2, within-subjects design totaling four rounds/Player 2s.” The first subtlety is that AMCE for partisanship is identified against the distribution of gender, age, race, etc. For generalizability, we may want a distribution close to the real world. As Hainmeuller et al. write: “…use the real-world distribution (e.g., the distribution of the attributes of actual politicians) to improve external validity. The fact that the analyst can control how the effects are averaged can also be viewed as a potential drawback, however. In some applied settings, it is not necessarily clear what distribution of the treatment components analysts should use to anchor inferences. In the worst-case scenario, researchers may intentionally or unintentionally misrepresent their empirical findings by using weights that exaggerate particular attribute combinations so as to produce effects in the desired direction.” Second, there is always a chance that it is a particular higher-order combination, e.g., race–PID, that ‘explains’ the main effect. 
  4. Skew in outcome variables means that the mean is not a good summary statistic. As you see in the last line of the first panel of Table 4 (Republican—Republican Dictator Game), if you can take out the 20% of the people who give $0, the average allocation from others is $4.2. HHPW handle this with a variable called ‘egoist’ and IW handle it with a separate column tallying people giving precisely $0. 
  5. The presence of ‘white foreigners’ can make people behave more generously. As Dube et al. find, “the presence of a white foreigner increases player contributions by 19 percent.” The point is more general, of course. 

With that, here are some things we can learn from economic games in HHPW and IW:

  1. People are very altruistic. In HPPW: “The modal strategy, employed in 25% of the rounds, was to retain 400 USh and to allocate 300 USh to each of the other players. The next most common strategy was to keep 600 USh and to allocate 200 USh to each of the other players (21% of rounds). In the vast majority of allocations, subjects appeared to adhere to the norm that the two receivers should be treated equally. On average, subjects retained 540 shillings and allocated 230 shillings to each of the other players. The modal strategy in the 500 USh denomination game (played in 73% of rounds) was to keep one 500 USh coin and allocate the other to another player. Nonetheless, in 23% of the rounds, subjects allocated both coins to the other players.” In IW, “[of the $10, players allocated] nontrivial amounts of their endowment—a mean of $4.17 (95% confidence interval [3.91, 4.43]) in the trust game, and a mean of $2.88 (95% confidence interval [2.66, 3.10])” (Note: These numbers are hard to reconcile with numbers in Table 4. One plausible explanation is that these numbers are over the entire population and Table 4 numbers are a subset on partisans and independents are somewhat less generous than partisans.) 
  2. There is no co-ethnic bias. Both HHPW and IW find this. HHPW: “we find no evidence that this altruism was directed more at in-group members than at out-group members. [Table 2]” IW: “From Figure 8, it is clear that in comparison with party, the effects of racial similarity proved negligible and not significant—coethnics were treated more generously (by eight cents, 95% confidence interval [–.11, .27]) in the dictator game, but incurred a loss (seven cents, 95% confidence interval [–.34, .20]) in the trust game.”
  3. A modest proportion of people discriminate against partisans. IW: “The average amount allocated to copartisans in the trust game was $4.58 (95% confidence interval [4.33, 4.83]), representing a “bonus” of some 10% over the average allocation of $4.17. In the dictator game, copartisans were awarded 24% over the average allocation.” But it is less dramatic than that. The key change in the dictator game is the number of people giving $0. The change in the percentage of people giving $0 is 7% among Democrats. So the average amount of money given to R and D by people who didn’t give $0 is $4.1 and $4.4 respectively which is a ~ 7% diff. 
  4. More Republicans than Democrats act like ‘homo-economicus.’ I am just going by the proportion of respondents giving $0 in dictator games.

p.s. I was surprised that there are no replication scripts or even a codebook for IW. The data had been downloaded 275 times when I checked.

Rocks and Scissors for Papers

17 Apr

Zach and Jack* write:

What sort of papers best serve their readers? We can enumerate desirable characteristics: these papers should

(i) provide intuition to aid the reader’s understanding, but clearly distinguish it from stronger conclusions supported by evidence;

(ii) describe empirical investigations that consider and rule out alternative hypotheses [62];

(iii) make clear the relationship between theoretical analysis and intuitive or empirical claims [64]; and

(iv) use language to empower the reader, choosing terminology to avoid misleading or unproven connotations, collisions with other definitions, or conflation with other related but distinct concepts [56].

Recent progress in machine learning comes despite frequent departures from these ideals. In this paper, we focus on the following four patterns that appear to us to be trending in ML scholarship:

1. Failure to distinguish between explanation and speculation.

2. Failure to identify the sources of empirical gains, e.g. emphasizing unnecessary modifications to neural architectures when gains actually stem from hyper-parameter tuning.

3. Mathiness: the use of mathematics that obfuscates or impresses rather than clarifies, e.g. by confusing technical and non-technical concepts.

4. Misuse of language, e.g. by choosing terms of art with colloquial connotations or by overloading established technical terms.

Funnily Zach and Jack fail to take their own advice, forgetting to distinguish between anecdotal evidence (they claim a ‘troubling trend’ without presenting systematic evidence for it). But the points they make are compelling. The second and third points are especially applicable to economics though they apply to a lot of scientific production.


* It is Zachary and Jacob.

Citing Working Papers

2 Apr

Public versions of working papers are increasingly the norm. So are citations to them. But there are four concerns with citing working papers:

  1. Peer review: Peer review improves the quality of papers, but often enough it doesn’t catch serious, basic issues. Thus, a lack of peer review is not as serious a problem as is often claimed.
  2. Versioning: Which version did you cite? Often, there is no canonical versioning system. The best we have is tracking which conference was the paper presented at. This is not good enough.
  3. Availability: Can I check the paper, code, and data for a version? Often enough, the answer is no.

The solution to the latter two is to increase transparency through the entire pipeline. For instance, people can check how my paper with Ken has evolved on Github, including any coding errors that have been fixed between versions. (Admittedly, the commit messages can be improved. Better commit messages—plus descriptions—can make it easier to track changes across versions.)

The first point doesn’t quite deserve addressing in that the current system draws an optimistic line on the quality of published papers. Peer review ought not to end when a paper is published in a journal. If we accept that, then all concerns flagged by peers and non-peers can be addressed in various commits or responses to issues and appropriately credited.

Stemming Link Rot

23 Mar

The Internet gives many things. But none that are permanent. That is about to change. Librarians got together and recently launched https://perma.cc/ which provides a permanent link to stuff.

Why is link rot important?

Here’s an excerpt from a paper by Gertler and Bullock:

“more than one-fourth of links published in the APSR in 2013 were broken by the end of 2014”

If what you are citing evaporates, there is no way to check the veracity of the claim. Journal editors: pay attention!

Sometimes Scientists Spread Misinformation

24 Aug

To err is human. Good scientists are aware of that, painfully so. The model scientist obsessively checks everything twice over and still keeps eyes peeled for loose ends. So it is a shock to learn that some of us are culpable for spreading misinformation.

Ken and I find that articles with serious errors, even articles based on fraudulent data, continue to be approvingly cited—cited without any mention of any concern—long after the problems have been publicized. Using a novel database of over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31% of the citations to retracted articles happen a year after the publication of the retraction notice. And that over 90% of these citations are approving.

What gives our findings particular teeth is the role citations play in science. Many, if not most, claims in a scientific article rely on work done by others. And scientists use citations to back such claims. The readers rely on scientists to note any concerns that impinge on the underlying evidence for the claim. And when scientists cite problematic articles without noting any concerns they very plausibly misinform their readers.

Though 74,000 is a large enough number to be deeply concerning, retractions are relatively infrequent. And that may lead some people to discount these results. Retractions may be infrequent but citations to retracted articles post-retraction are extremely revealing. Retractions are a low-low bar. Retractions are often a result of convincing evidence of serious malpractice, generally fraud or serious error. Anything else, for example, a serious error in data analysis, is usually allowed to self-correct. And if scientists are approvingly citing retracted articles after they have been retracted, it means that they have failed to hurdle the low-low bar. Such failure suggests a broader malaise.

To investigate the broader malaise, Ken and I exploited data from an article published in Nature that notes a statistical error in a series of articles published in prominent journals. Once again, we find that approving citations to erroneous articles persist after the error has been publicized. After the error has been publicized, the rate of citation to erroneous articles is, if anything, higher, and 98% of the citations are approving.

In all, it seems, we are failing.

The New Unit of Scientific Production

11 Aug

One fundamental principle of science is that there is no privileged observer. You get to question what people did. But to question, you first must know what people did. So part of good scientific practice is to make it easy for people to understand how the sausage was made—how the data were collected, transformed, and analyzed—and ideally, why you chose to make the sausage that particular way. Papers are ok places for describing all this, but we now have better tools: version controlled repositories with notebooks and readme files.

The barrier to understanding is not just lack of information, but also poorly organized information. There are three different arcs of information: cross-sectional (where everything is and how it relates to each other), temporal (how the pieces evolve over time), and inter-personal (who is making the changes). To be organized cross-sectionally, you need to be macro organized (where is the data, where are the scripts, what do each of the scripts do, how do I know what the data mean, etc.), and micro organized (have logic and organization to each script; this also means following good coding style). Temporal organization in version control simply requires you to have meaningful commit messages. And inter-personal organization requires no effort at all, beyond the logic of pull requests.

The obvious benefits of this new way are known. But what is less discussed is that this new way allows you to critique specific pull requests and decisions made in certain commits. This provides an entirely new way to make progress in science. The new unit of science also means that we just don’t dole out credits in crude currency like journal articles but we can also provide lower denominations. We can credit each edit, each suggestion. And why not. The third big benefit is that we can build epistemological trees where the logic of disagreement is clear.

The dead tree edition is dead. It is also time to retire the e-version of the dead tree edition.