Results May Vary

Results May Vary
Photo by Grace Madeline / Unsplash

For the last many years, https://i4replication.org/ has been helping replicate papers in prominent economics and political science journals (and two general interest journals). To find out how well the studies replicate, I scraped the abstracts of the 192 or so discussion papers based on the replication attempts (see here). I excluded replication attempts that were not targeted toward a particular paper. I also excluded authors' responses. I coded the abstracts on three dimensions:

  1. Could we reproduce the results reported in the paper with the code and data provided (excluding notable data and coding errors)?
  2. Are the results "robust" on whatever criteria replicators used?
    The robustness tests that the studies were subjected to varied considerably. Most commonly, replicators tried alternate specifications (as chosen by the reviewers) and different ways to cluster the standard error. A few tried replicating on different data, dropping "outliers," etc.
  3. Was the data and code posted reasonably complete to allow replicators to replicate independently?

I was generous in my coding. For instance, I didn't penalize minor errors that had a "slight" impact on the results. (The coded data is posted here.) Nearly 96% of the results were "computationally reproducible." These results were subject to robustness tests that spanned a wide spectrum, from trying to replicate on new data to testing robustness to specifications. About 70% of the results are "robust." Lastly, about 95% of the studies seemed to have data and code in good enough order that replicators could get to the other side without too much hand-holding.

For form's sake, I had to ask ChatGPT the same. The commensurate numbers are 88%, 74%, and 95%.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe