Bad Remedies for Bad Science

3 Sep

Lack of reproducibility is a symptom of science in crisis. An eye-catching symptom to be sure, but hardly the only one vying for attention. Recent analyses suggest that nearly two-thirds of the (relevant set of) articles published in prominent political science journals condition on post-treatment variables (see here.) Another analysis suggests that half of the relevant set of articles published in prominent neuroscience journals interpret the difference between a significant and non-significant result as evidence that the difference between the two is significant (see here). What is behind this? My guess: poor understanding of statistics, poor editorial processes, and poor strategic incentives.

  1. Poor understanding of statistics among authors. It likely stems from:
  2. Poor understanding of statistics among editors, reviewers, etc. This creates two problems:
    • Cannot catch inevitable mistakes: Whatever the failings of authors, they aren’t being caught during the review process. (It would be good to know how often reviewers are the source of bad recommendations.)
    • Creates Bad Incentives: If editors are misinformed, say to look for significant results, authors will be motivated to deliver to that.
      • If you know what is the right thing to do but know that there is a premium for doing the wrong thing (see the second point of the second point), you may use a lack of transparency as a way to cater to bad incentives.
  3.  Psychological biases:
    • Motivated Biases: Scientists are likely biased toward their own theories. They wish them to be true. This may lead to motivated skepticism and scrutiny. The same principle likely applies to reviewers who catch on to the storytelling and give a wider pass to stories that jive with them.
  4. Production Pressures: Given production pressures, there is likely sloppiness in what is produced. For instance, it is troubling how often retracted articles are cited after the publication of the retraction notice.
  5. Weak Penalties for Being Sloppy: Without easy ways for others to find mistakes, it is easier to be sloppy.

Given these problems, the big solution I can think of is improving training. Another would be programs that highlight some of the psychological biases and drive clarity on the purpose of science. The troubling part is that the most commonly proposed solution is transparency. As Gelman points out, transparency is neither necessary nor sufficient to prevent the “statistical and scientific problems” that underlie “the scientific crisis” because:

  1. Emphasis on transparency would merely mean transparent production of noise (last column on page 38).
  2. Transparency makes it a tad easier to spot errors but doesn’t provide incentives to learn from errors. And a thumb rule is to fix upstream issues than downstream issues.

Gelman also points out the negative externalities of transparency as a be-all fix. When you focus on transparency, secrecy is conflated with dishonesty.

Unlisted False Negatives: Are 11% Americans Unlisted?

21 Aug

A recent study by Simon Jackman and Bradley Spahn claims that 11% of Americans are ‘unlisted.’ (The paper has since been picked up by liberal media outlets like the Think Progress.)

When I first came across the paper, I thought that the number was much too high for it to have any reasonable chance of being right. My suspicions were roused further by the fact that the paper provided no bounds on the number — no note about measurement error in matching people across imperfect lists. A galling omission when the finding hinges on the name matching procedure, details of which are left to another paper. What makes it to the paper is this incredibly vague line: “ANES collects …. bolstering our confidence in the matches of respondents to the lists.” I take that to mean that the matching procedure was done with the idea of reducing false positives. If so, the estimate is merely an upper bound on the percentage of Americans who could be unlisted. That isn’t a very useful number.

But reality is a bit worse. To my questions about false positive and negative rates, Bradley Spahn responded on Twitter, “I think all of the contentious cases were decided by me. What are my decision-theoretic properties? Hard to say.” That line covers one of the most essential details of the matching procedure, a detail they say the readers can find “in a companion paper.” The primary issue is subjectivity. But not taking adequate account of the relevance of ‘decision theoretic’ properties to the results in the paper grates.

Optimal Cost Function When the Cost of Misclassification is Higher for the Customer than for the Business

15 Apr

Consider a bank making decisions about loans. For the bank, making lending decisions optimally means reducing prediction errors*(cost of errors) minus the cost of making predictions (Keeping things simple here). The cost of any one particular error — especially, denial of loan when eligible– is typically small for the bank, but very consequential for the applicant. So the applicant may be willing to pay the bank money to increase the accuracy of their decisions. Say, willing to compensate the bank for the cost of getting a person to take a closer look at the file. If customers are willing to pay the cost, accuracy rates can increase without reducing profits. (Under some circumstances, a bank may well be able to increase profits.) Customer’s willingness to pay for increasing accuracy is typically not exploited by the lending institutions. It may be well worth exploring it.

The Human and the Machine: Semi-automated approaches to ML

12 Apr

For a class of problems, a combination of algorithms and human input makes for the most optimal solution. For instance, three years ago software to recreate shredded documents that won the DARPA award used “human[s] [to] verify what the computer was recommending.” The insight is used in character recognition tasks. I have used it to create software for matching dirty data — the software was used to merge shape files with electoral returns at precinct level.

The class of problems for which human input proves useful has one essential attribute — humans produce unbiased, if error-prone, estimates for these problems. So for instance, it would be unwise to use humans for making the ‘last mile’ of lending decisions (see also this NYT article). (And that is something you may want to verify with training data.)

Estimating Hillary’s Missing Emails

11 Apr

Note:

55000/(365*4) ~ 37.7. That seems a touch low for Sec. of state.

Caveats:
1. Clinton may have used more than one private server
2. Clinton may have sent emails from other servers to unofficial accounts of other state department employees

Lower bound for missing emails from Clinton:

  1. Take a small weighted random sample (weighting seniority more) of top state department employees.
  2. Go through their email accounts on the state dep. server and count # of emails from Clinton to their state dep. addresses.
  3. Compare it to # of emails to these employees from the Clinton cache.

To propose amendments, go to the Github gist

(No) Value Added Models

6 Jul

This note is in response to some of the points raised in the Agnoff Lecture by Ed Haertel.

The lecture makes two big points:
1) Teacher effectiveness ratings based on current Value Added Models are ‘unreliable.’ They are actually much worse than just unreliable; see below.
2) Simulated counterfactuals of gains that can be got from ‘firing bad teachers’ are upwardly biased.

Three simple tricks (one discussed; two not) that may solve some of the issues:
1) Estimating teaching effectiveness: Where possible, random assignment of children to classes. I would only do within school comparisons. Inference will still not be clean (SUTVA violations, though they can be dealt with). Simply cleaner.

2) Experiment with teachers. Teach some teachers some skills. Estimate the impact. Rather than teacher level VAM, do a skill level VAM. Teachers = sum of skills + idiosyncratic variation.

3) For current VAMs: To create better student level counterfactuals, use modern ML techniques (SVM, Neural Networks..), lots of data (past student outcomes, past classmate outcomes etc.), cross-validate to tune. Have a good idea about how good the prediction is. The strategy may be applicable to other venues.

Other points:
1) Haertel says, “Obviously, teachers matter enormously. A classroom full of students with no teacher would probably not learn much — at least not much of the prescribed curriculum.” A better comparison perhaps would be to self-guided technology. My sense is that as technology evolves, teachers will come up short in a comparison between teachers and advanced learning tools. In most of the third world, I think it is already true.

2) It appears no model for calculating teacher effectiveness scores yields identified estimates. And it appears we have no clear understanding of the nature of bias. Pooling biased estimates over multiple years doesn’t recommend itself to me as a natural fix to this situation. And I don’t think calling this situation as ‘unreliability’ of scores is right. These scores aren’t valid. The fact that pooling across years ‘works’ may suggest issues are smaller. But then again, bad things may be happening to some kinds of teachers, especially if people are doing cross-school comparisons.

3) Fade-out concern is important given the earlier 5*5 =25 analysis. My suspicion would be that attenuation of effects varies depending on when the timing of the shock. My hunch would be that shocks at an earlier age matter more – they decay slower.

Impact of selection bias in experiments where people treat each other

20 Jun

Selection biases in the participant pool generally have limited impact on inference. One way to estimate population treatment effect from effects estimated using biased samples is to check if treatment effect varies by ‘kinds of people’, and then weight the treatment effect to population marginals. So far so good.

When people treat each other, selection biases in participant pool change the nature of the treatment. For instance, in a Deliberative Poll, a portion of the treatment is other people. Naturally then, the exact treatment depends on the pool of people. Biases in the initial pool of participants mean treatment is different. For inference, one may exploit across group variation in composition.

Sampling on M-Turk

13 Oct

In many of the studies that use M-Turk, there appears to be little strategy to sampling. A study is posted (and reposted) on M-Turk till a particular number of respondents take the study. If the pool of respondents reflects true population proportions, if people arrive in no particular order, and all kinds of people find the monetary incentive equally attractive, the method should work well. There is reasonable evidence to suggest that at least points 1 and 3 are violated. One costly but easy fix for the third point is to increase payment rates. We can likely do better.

If we are agnostic about variable on which we want precision, here’s one way to sample: Start with a list of strata, and their proportions in the population of interest. If the population of interest is sample of US adults, the proportions are easily known. Set up screening questions, and recruit. Raise price to get people in cells that are running short. Take simple precautions. For one, to prevent gaming, do not change the recruitment prompt to let people know that you want X kinds of people.

Why Were the Polls so Accurate?

16 Nov

The Quant. Interwebs have overflowed with joy since the election. Poll aggregation works. And so indeed does polling, though you won’t hear as much about it on the news, which is likely biased towards celebrity intellects than the hardworking many. But why were the polls so accurate?

One potential explanation: because they do some things badly. For instance, most fail at collecting “random samples” these days, because of a fair bit of nonresponse bias. This nonresponse bias, if correlated with the propensity to vote, may actually push up the accuracy of the vote choice means. There are a few ways to check this theory.

One way to check this hypothesis: were the results from polls using Likely Voter screens different from those not using them? If not, why not? From the Political Science literature, we know that people who vote (not just those who say they vote) do vary a bit from those who do not vote, even on things like vote choice. For instance, there is just a larger proportion of `independents’ among them.

Other kinds of evidence will be in the form of failure to match population or other benchmarks. For instance, election polls would likely fare poorly when predicting how many people voted in each state. Or tallying up Spanish language households or number of registered. Another way of saying this is that the bias will vary by what parameter we aggregate from these polling data.

So let me reframe the question: how do polls get election numbers right even when they undercount Spanish speakers? One explanation is that there is a positive correlation between selection into polling, and propensity to vote, which makes vote choice means much more reflective of what we will see come election day.

The other possible explanation to all this – post-stratification or other posthoc adjustment to numbers, or innovations in how sampling is done: matching, stratification etc. Doing so uses additional knowledge about the population and can shrink s.e.s and improve accuracy. One way to test such non-randomness: over tight confidence bounds. Many polls tend to do wonderfully on multiple uncorrelated variables, for instance, census region proportions, gender, … etc., something random samples cannot regularly produce.

Randomly Redistricting More Efficiently

25 Sep

In a forthcoming article, Chen and Rodden estimate the effect of ‘Unintentional gerrymandering’ on number of seats that go to a particular party. To do so they pick a precinct at random, and then add (randomly chosen) adjacent precincts to it till the district is of a certain size (decided by the total number of districts one wants to create). Then they go about creating a new district in the same manner, randomly selecting a precinct bordering the first district. This goes on till all the precincts are assigned to a district. There are some additional details but they are immaterial to the point of the note. A smarter way to do the same thing would be to just create one district over and over again (starting with a randomly chosen precinct). This would reduce the computational burden (memory for storing edges, differencing shapefiles, etc.) while leaving estimates unchanged.

A Potential Source of Bias in Estimating the Impact of Televised Campaign Ads

16 Aug

Or When Treatment is Strategic, No-Intent-to-Treat Intent-to-Treat Effects can be biased

One popular strategy for estimating the impact of televised campaign ads is by exploiting ‘accidental spillover’ (see Huber and Arceneaux 2007). The identification strategy builds on the following facts: Ads on local television can only be targeted at the DMA level. DMAs sometimes span multiple states. Where DMAs span battleground and non-battleground states, ads targeted for residents of battleground states are seen by those in non-battleground states. In short, people in non-battleground states are ‘inadvertently’ exposed to the ‘treatment’. Behavior/Attitudes etc. of the residents who were inadvertently exposed are then compared to those of other (unexposed) residents in those states. The benefit of this identification strategy is that it allows television ads to be decoupled from the ground campaign and other campaign activities, such as presidential visits (though people in the spillover region are exposed to television coverage of the visits). It also decouples ad exposure etc. from strategic targeting of the people based on characteristics of the battleground DMA etc. There is evidence that content, style, the volume, etc. of television ads is ‘context aware’ – varies depending on what ‘DMA’ they run in etc. (After accounting for cost of running ads in the DMA, some variation in volume/content etc. across DMAs within states can be explained by partisan profile of the DMA, etc.)

By decoupling strategic targeting from message volume and content, we only get an estimate of the ‘treatment’ targeted dumbly. If one wants an estimate of ‘strategic treatment’, such quasi-experimental designs relying on accidental spillover may be inappropriate. How to estimate then the impact of strategically targeted televised campaign ads: first estimate how ads are targeted depending on area and people (Political interest moderates the impact of political ads [see for e.g. Ansolabehere and Iyengar 1995]) characteristics, next estimate effect of messages using the H/A strategy, and then re-weight the effect using estimates of how the ad is targeted.

One can also try to estimate the effect of ‘strategy’ by comparing adjusted treatment effect estimates in DMAs where treatment was targeted vis-a-vis (captured by regressing out other campaign activity) and where it wasn’t.

Sample This

1 Aug

What do single shot evaluations of MT (replace it with anything else) samples (vis-a-vis census figures) tell us? I am afraid very little. Inference rests upon knowledge of the data (here – respondent) generating process. Without a model of the data generating process, all such research reverts to modest tautology – sample A was closer to census figures than sample B on parameters X,Y, and Z. This kind of comparison has a limited utility: as a companion for substantive research. However, it is not particularly useful if we want to understand the characteristics of the data generating process. For even if respondent generation process is random, any one draw (here – sample) can be far from the true population parameter(s).

Even with lots of samples (respondents), we may not be able to say much if the data generation process is variable. Where there is little expectation that the data generation process will be constant, and it is hard to understand why MT respondent generation process for political surveys will be a constant one (it likely depends on the pool of respondents, which in turn perhaps depends on the economy etc., the incentives offered, the changing lure of incentives, the content of the survey, etc.), we cannot generalize. Of course one way to correct for all of that is to model this variation in the data generating process, but that will require longer observational spans, and more attention to potential sources of variation etc.

Representativeness Heuristic, Base Rates, and Bayes

23 Apr

From the Introduction of their edited volume:
Tversky and Kahneman used the following experiment for testing ‘representativeness heuristic’ –

Subjects are shown a brief personality description of several individuals, sampled at random from 100 professionals – engineers and lawyers.
Subjects are asked to assess whether the description is of an engineer or a lawyer.
In one condition, subjects are told group = 70 engineers/30 lawyers. Another the reverse = 70 lawyers/30 engineers.

Results –
Both conditions produced same mean probability judgments.

Discussion:
Tversky and Kahneman call this result a ‘sharp violation’ of Bayes Rule.

Counterpoint:
I am not sure the experiment shows any such thing. Mathematical formulation of the objection is simple and boring so an example. Imagine, there are red and black balls in an urn. Subjects are asked if the ball is black or red under two alternate descriptions of the urn composition. When people are completely sure of the color, the urn composition obviously should have no effect. Just because there is one black ball in the urn (out of say a 100), it doesn’t mean that the person will start thinking that the black ball in her hand is actually red. So on and so forth. One wants to apply Bayes by accounting for uncertainty. People are typically more certain (lots of evidence it seems – even in their edited volume) so that automatically discounts urn composition. People may not be violating Bayes Rule. They may just be feeding the formula incorrect data.

Correcting for Differential Measurement Error in Experiments

14 Feb

Differential measurement error across control and treatment groups or in a within-subjects experiment, pre- and post-treatment measurement waves, can vitiate estimates of treatment effect. One reason for differential measurement error in surveys is differential motivation. For instance, if participants in the control group (pre-treatment survey) are less motivated to respond accurately than participants in the treatment group (post-treatment survey), the difference in means estimator will be a biased estimator of the treatment effect. For example, in Deliberative Polls, participants acquiesce more during the pre-treatment survey than the post-treatment survey (Weiksner, 2008). To correct for it, one may want to replace agree/disagree responses with construct specific questions (Weiksner, 2008). Perhaps a better solution would be to incentivize all (or a random subset of) responses to the pre-treatment survey. Possible incentives include – monetary rewards, adding a preface to the screens telling people how important accurate responses are to research, etc. This is the same strategy that I advocate for dealing with satisficing more generally (see here) – which translates to minimizing errors, than the more common, more suboptimal strategy of “balancing errors” by randomizing the response order.

Against Proxy Variables

23 Dec

Lacking direct measures of the theoretical variable of interest, some rely on “proxy variables.” For instance, some have used years of education as a proxy for cognitive ability. However, using “proxy variables” can be problematic for the following reasons — (1) proxy variables may not track the theoretical variable of interest very well, (2) they may track other confounding variables, outside the theoretical variable of interest. For instance, in the case of years of education as a proxy for cognitive ability, the concerns manifest themselves as follows:

  1. Cognitive ability causes, and is a consequence of, what courses you take, and what school you go to, in addition to, of course, years of education. GSS, for instance, contains more granular measures of education, for instance, did the respondent take a science course in college. And nearly always the variable proves significant when predicting knowledge, etc. This all is somewhat surmountable as it can be seen as measurement error.
  2. More problematically, years of education may tally other confounding variables – diligence, education of parents, economic strata, etc. And then education endows people with more than cognitive ability; it also causes potentially confounding variables such as civic engagement, knowledge, etc.

Conservatively we can only attribute the effect of the variable to the variable itself. That is – we only have variables we enter. If one does rely on proxy variables then one may want to address the two points mentioned above.

Impact of Menu on Choices: Choosing What You Want Or Deciding What You Should Want

24 Sep

In Predictably Irrational, Dan Ariely discusses the clever (ex)-subscription menu of The Economist that purportedly manipulates people to subscribe to a pricier plan. In an experiment based on the menu, Ariely shows that addition of an item to the menu (that very few choose) can cause preference reversal over other items in the menu.

Let’s consider a minor variation of Ariely’s experiment. Assume there are two different menus that look as follows:
1. 400 cal, 500 cal.
2. 400 cal, 500 cal, 800 cal.

Assume that all items cost and taste the same. When given the first menu, say 20% choose the 500 calorie item. When selecting from the second menu, percent of respondents selecting the 500 calorie choice is likely to be significantly greater.

Now, why may that be? One reason may be that people do not have absolute preferences; here for a specific number of calories. And that people make judgments about what is the reasonable number of calories based on the menu. For instance, they decide that they do not want the item with the maximum calorie count. And when presented with a menu with more than two distinct calorie choices, another consideration comes to mind — they do not too little food either. More generally, they may let the options on the menu anchor for them what is ‘too much’ and what is ‘too little.’

If this is true, it can have potentially negative consequences. For instance, McDonald’s has on the menu a Bacon Angus Burger that is about 1360 calories (calories are now being displayed on McDonald’s menus courtesy Richard Thaler). It is possible that people choose higher calorie items when they see this menu option, than when they do not.

More generally, people’s reliance on the menu to discover their own preferences means that marketers can manipulate what is seen as the middle (and hence ‘reasonable’). This also translates to some degree to politics where what is considered the middle (in both social and economic policy) is sometimes exogenously shifted by the elites.

That is but one way a choice on the menu can impact preference order over other choices. Separately, sometimes a choice can prime people about how to judge other choices. For instance, in a paper exploring effect of Nader on preferences over Bush and Kerry, researchers find that “[W]hen Nader is in the choice set all voters’ choices are more sharply aligned with their spatial placements of the candidates.”

This all means, assumptions of IIA need to be rethought. Adverse conclusions about human rationality are best withheld (see Sen).

Further Reading:

1. R. Duncan Luce and Howard Raiffa. Games and Decision. John Wiley and Sons, Inc., 1957.
2. Amartya Sen. Internal consistency of choice. Econometrica, 61(3):495– -521, May 1993.
3. Amartya Sen. Is the idea of purely internal consistency of choice bizarre? In J.E.J. Altham and Ross Harrison, editors, World, Mind, and Ethics. Essays on the ethical philosophy of Bernard Williams. Cambridge University Press, 1995.

Reconceptualizing the Effect of the Deliberative Poll

6 Sep

Deliberative Poll proceeds as follows — Respondents are surveyed, provided ‘balanced’ briefing materials, randomly assigned to moderated small group discussions, allowed the opportunity to quiz experts or politicians in plenary sessions, and re-interviewed at the end. The “effect” is conceptualized as average Post–Pre across all participants.

The effect of the Deliberative Poll is contingent upon a particular random assignment to small groups. This isn’t an issue if small group composition doesn’t matter. If it does, then the counterfactual imagination of the ‘informed public’ is somewhat particularistic. Under those circumstances, one may want to come up with a distribution of what opinion change may look like if the assignment of participants to small groups was different. One can do this by estimating the impact of small group composition on the dependent variable of interest and then predicting the dependent variable of interest under simulated alternate assignments.

See also: Adjusting for covariate imbalance in experiments with SUTVA violations

Weighting to Multiple Datasets

27 Aug

Say there are two datasets—one that carries attitudinal variables, and demographic variables (dataset 1), and another that carries just demographic variables (dataset 2). Also assume that Dataset 2 is the more accurate and larger dataset for demographics (e.g. CPS). Our goal is to weight a dataset (dataset 3) so that it is “closest” to the population at large on both socio-demographic characteristics and attitudinal variables. We can proceed in the following manner: weight Dataset 1 to Dataset 2, and then weight dataset 3 to dataset 1. This will mean multiplying the weights. One may also impute attitudes for the larger dataset (dataset 2), using a prediction model built using dataset 1, and then use the larger dataset to generalize to the population.

Size Matters, Significantly

26 Aug

Achieving statistical significance is entirely a matter of sample size. In the frequentist world, we can always distinguish between two samples if we have enough data (except of course if the samples are exactly the same). On the other hand, we may fail to reject even large differences when sample sizes are small. For example, over 13 Deliberative Polls (list at the end), the correlation between the proportion of attitude indices showing significant change and size of the participant sample is .81 (rank ordered correlation is .71). This sharp correlation is suggestive evidence that average effect is roughly equal across polls (and hence power matters).

When the conservative thing to do is to the reject the null, for example, in “representativeness” analysis designed to see if the experimental sample is different from control, one may want to go for large sample sizes or say something about substantiveness of differences, or ‘adjust’ results for differences. If we don’t do that samples can look more ‘representative’ as sample size reduces. So for instance, the rank-ordered correlation between proportion significant differences between non-participants and participants, and the size of the smaller sample (participant sample), for the 13 polls is .5. The somewhat low correlation is slightly surprising. It is partly a result of the negative correlation between the size of the participant pool and average size of the differences.

Polls included: Texas Utilities: (CPL, WTU, SWEPCO, HLP, Entergy, SPS, TU, EPE), Europolis 2009, China Zeguo, UK Crime, Australia Referendum, and NIC

Adjusting for Covariate Imbalance in Experiments with SUTVA Violations

25 Aug

Consider the following scenario: control group is 50% female while the participant sample is 60% female. Also, assume that this discrepancy is solely a matter of chance and that the effect of the experiment varies by gender. To estimate the effect of the experiment, one needs to adjust for the discrepancy, which can be done via matching, regression, etc.

If the effect of the experiment depends on the nature of the participant pool, such adjustments won’t be enough. Part of the effect of Deliberative Polls is a consequence of the pool of respondents. It is expected that the pool matters only in small group deliberation. Given people are randomly assigned to small groups, one can exploit the natural variation across groups to estimate how say proportion females in a group impacts attitudes (dependent variable of interest). If that relationship is minimal, no adjustments outside the usual are needed. If, however, there is a strong relationship, one may want to adjust as follows: predict attitudes under simulated groups from a weighted sample, with the probability of selection proportional to the weight. This will give us a distribution — which is correct— as women may be allocated in a variety of ways to small groups.

There are many caveats, beginning with limitations of data in estimating the impact of group characteristics on individual attitudes, especially if effects are heterogeneous. Where proportions of subgroups are somewhat small, inadequate variation across small groups can result.

This procedure can be generalized to a variety of cases where the effect is determined by the participant pool except where each participant interacts with the entire sample (or a large proportion of it). Reliability of the generalization will depend on getting good estimates.