Hardly a day passes without a major company announcing the release of a new scientific paper or code around a powerful technique. But why do so many companies open source (via papers and code) so many impactful technologies almost as soon as they are invented? The traditional answers—to attract talent, and to generate hype—are not compelling. Let’s start with the size of the pie. Stability AI, based solely on an open-source model quickly raised money at a valuation of $1B. Assuming valuations bake in competitors, lots of money was left on the table in this one case. Next, come to the credit side — literally. What is the value of headlines (credit) during a news cycle, which usually lasts less than a day? As for talent, the price for the pain of not publishing ought not to be that high. And the peculiar thing is that not all companies seem to ooze valuable IP. For instance, prominent technology companies like Apple, Amazon, Netflix, etc. don’t ooze much at all. All that suggests that this is a consequence of poor management. But let’s assume for a second that the tendency was ubiquitous. There could be three reasons for it. First, it could be the case that companies are open-sourcing things they know others will release tomorrow to undercut others or to call dibs on the hype cycle. Another reason could be that they release things for the developer ecosystem on their platform. Except, this just happens not to be true. Another plausible answer is that when technology moves at a really fast pace — what is hard today is easy tomorrow— the window for monetization is small and companies forfeit these small benefits and just skim the hype. (But then, why invest in it in the first place?)
One common variety of empirical papers in finance uses staggered adoption of a policy to identify the impact of the policy on firms registered in the state, e.g., profits, etc. There are a host of problems with such empirical strategies, including, prominently, 1. policy adoption and timing are generally endogenous to firm activity within and across states and 2. it is hard to account for the impact of other policies that may have been adopted after a policy and may explain the outcomes. Recent econometric literature adds to the list of issues by pointing out a serious problem with using two-way fixed effects to analyze such data. I add to the pile. After the first state has adopted a policy, any firm registrations are endogenous to the policy environment—firms choose to locate (relocate) based on state policies. So when analyzing data from such a research design, we may want to fix the cohort of firms to firm-registration status to before the time the first state changed its policy and treat registration changes, etc., as non-compliance.
Say that people can be easily identified by characteristic C. Say that the average tip left by people of group C_A is smaller than !C_A with a wide variance in tipped amounts within each group. Let’s assume that the quality of service (two levels: high or low) is pro-rated by the expected tip amount. Let’s assume that the tip left by a customer is explained by the quality of service. And let’s also assume that the expected tip amount from C_A is low enough to motivate low-quality service. The tip is provided after the service. Assume no-repeat visitation. The optimal strategy for the customer is to not tip. But the service provider notices the departure from rationality from customers and serves accordingly. If the server had complete information about what each person would tip, then the service would be perfectly calibrated by the tipped amount. However, the server can only rely on crude surface cues, like C, and estimate the expected value of the tip. Given that, the optimal strategy for the server is to provide low-quality service to C_A, which would lead to a negative spiral.
It used to be that retail prices of generic products like coffee mugs, soap, etc., moved slowly. Not anymore. On major web retailers like Amazon, for a range of generic household products, the variation in prices over short periods of time is immense. For instance, on 12-Piece Porcelain, 12 Oz. Coffee Mug Set, the price ranged between $20.50 and $35.71 over the last year or so, with a hefty day-to-day variation.
On PCPartPicker, the variation in prices for Samsung SSD is equally impressive. Prices zig-zag on multiple sites (e.g., Dell, Adorama) by $100 over a matter of days multiple times over the last six months. (The cross-site variation—price dispersion—at a particular point in time is also impressive.)
Take another example. Softsoap Liquid Hand Soap, Fresh Breeze – 7.5 Fl Oz (Pack of 6) shows a very high-frequency change between $7.44 and $11. (See also Irish Spring Men’s Deodorant Bar Soap, Original Scent – 3.7 Ounce.)
What explains the within-site over-time variation? One reason could be supply and demand. There are three reasons I am skeptical of the explanation. First, on Amazon, the third-party new item price time series and Amazon price time series do not appear to be correlated (statistics by informal inspection or as one of my statistics professors used to call it—the ocular distortion test—so caveat emptor). On PCPartPicker, you see much the same thing: the cross-retailer price time series frequently crossover. Second, related to the first point, we should see a strong correlation in overtime price curves across substitutes. We do not. Third, the demand for generic household products should be readily forecastable, and the optimal dry good storage strategy is likely not storing just enough. Further, I am skeptical of strong non-linearities in the marginal cost of furnishing an item that is not in the inventory—much of it should be easily replenishable.
The other explanation is price exploration, with Amazon continuously exploring the profit-maximizing price. But this is also unpersuasive. The range over which the prices vary over short periods of time is too large, especially given substitutes and absent collusion. Presumably, companies have thought about the negative consequences of such wide price exploration bands. For instance, you cannot build a reputation as the ‘cheapest’ (unless there is coordination or structural reason for prices to move together.)
So I come empty when it comes to explanations. There is the crazy algorithm theory—as inventory dwindles, Amazon really hikes the price, and when it sees no sales, it brings the price right back down. It may explain the frequent sharp movements over a fixed band that you see in some places but plausibly doesn’t explain a lot of the other patterns we see.
Forget the explanations and let’s engage with the empirical fact. My hunch is that customers are unaware of the striking variation in the prices of many goods. Second, if customers become aware of this, their optimal strategy would be to use sites like CamelCamelCamel or PCPartPicker to pick the optimal time for purchasing a good. If retailers are somehow varying prices to explore profit-maximizing pricing (minus price discrimination based on location, etc.), and if all customers adopt the strategy of timing the purchase, then, in equilibrium, the retailer strategy would reduce to constant pricing.
p.s. I found it funny that there are ‘used product’ listings for soap.
p.p.s. I wrote about the puzzle of price dispersion on Amazon here.
In particular, in 521 villages in Haryana, we provided information on monthly immunization camps to either randomly selected individuals (in some villages) or to individuals nominated by villagers as people who would be good at transmitting information (in other villages). We find that the number of children vaccinated every month is 22% higher in villages in which nominees received the information.From Banerjee et al. 2019
The buildings, which are social units, were randomized to (1) targeting 20% of the women at random, (2) targeting friends of such randomly chosen women, (3) targeting pairs of people composed of randomly chosen women and a friend, or (4) no targeting. Both targeting algorithms, friendship nomination and pair targeting, enhanced adoption of a public health intervention related to the use of iron-fortified salt for anemia.
Coupon redemption reports showed that unadjusted adoption rates were 13.6% (SE = 1.5%) in the friend-targeted clusters, 11.2% (SE = 1.4%) in pair-targeted clusters, 9.1% (SE = 1.3%) in the randomly targeted clusters, and 0% in the control clusters receiving no intervention.From Alexander et al. 2022
Here’s a Twitter thread on the topic by Nicholas Christakis.
Targeting “structurally influential individuals,” e.g., people with lots of friends, people who are well regarded, etc., can lead to larger returns per ‘contact.’ This can be a useful thing. And as the studies demonstrate, finding these influential people is not hard—just ask a few people. There are, however, a few concerns:
- One of the concerns with any targeting strategy is that it can change who is treated. When you use network-based targeting, it biases the treated sample toward those who are more connected. That could be a good thing, especially if returns are the highest on those with the most friends, like in the case of curbing contagious diseases, or it could be a bad thing if the returns are the greatest on the least connected people. The more general point here is that most ROI calculations for network targeting have only accounted for costs of contact and assumed the benefits to be either constant or increasing in network size. One can easily rectify this by specifying the ROI function more fully or adding “fairness” or some kind of balance as a constraint.
- There is some stochasticity that stems from which person is targeted, and their idiosyncratic impact needs to be baked into standard error calculations for the ‘treatment,’ which is the joint of whatever the experimenters are doing and what the individual chooses to do with the experimenter’s directions (compliance needs a more careful definition). Interventions with targeting are liable to have thus more variable effects than without targeting and plausibly need to be reproduced more often before they used as policy.
This is a review of Noise, A Flaw in Human Judgment by Kahneman, Sibony, and Sunstein.
The phrase “noise in decision making” brings to mind “random” error. Scientists, however, shy away from random error. Science is mostly about systematic error, except, perhaps, quantum physics. So Kahneman et al. conceive of noise as seemingly random error that is a result of unmeasured biases. For instance, research suggests that heat causes bad mood. And bad mood may, in turn, cause people to judge more harshly. If this were to hold, the variability in judging stemming from the weather can end up being interpreted as noise. But, as is clear, there is no “random” error, merely bias. Kahneman et al. make a hash of this point. Early on, they give the conventional formula of total expected error as the sum of bias and variance (they don’t further decompose variance into irreducible error and ‘random’ error) with the aim of talking about the two separately, and naturally, never succeed in doing that.
The conceptual issues ought not detract us from the important point of the book. It is useful to think about human judgment systems as mathematical functions. We should expect the same inputs to map to the same output. It turns out that it isn’t even remotely true in most human decision-making systems. Take insurance underwriting, for instance. Given the same data (realistic but made-up information about cases), the median percentage difference between quotes between any pair of underwriters is an eye-watering 55% (which means that for half of the cases, it is worse than 55%), about five times as large as expected by the executives. There are a few interesting points that flow from this data. First, if you are a customer, your optimal strategy is to get multiple quotes. Second, what explains ignorance about the disagreement? There could be a few reasons. First, when people come across a quote from another underwriter, they may ‘anchor’ their estimate on the number they see, reducing the gap between the number and the counterfactual. Second, colleagues plausibly read to agree—less effort and optimizing for collegiality, asking, “Could this make sense?”, than read to evaluate, “Does this make sense?” (see my notes for a fuller set of potential explanations.)
Data from asylum reviews is yet starker. “A study of cases that were randomly allotted to different judges found that one judge admitted 5% of applicants, while another admitted 88%.” (Paper.)
Variability can stem from only two things. It could be that the data doesn’t allow for a unique judgment (irreducible error). (But even here, the final judgment should reflect the uncertainty in the data.) Or that at least one person is ‘wrong’ (has a different answer than others). Among other things, this can stem from:
- variation in skill, e.g., how to assess patent applications
- variation in effort, e.g., some people put more effort than others
- agency and preferences, e.g., I am a conservative judge, and I can deny an asylum application because I have the power to do so
- biases like using irrelevant information, e.g., weather, hypoglycemia, etc.
(Note: a lack of variability doesn’t mean we are on to the right answer.)
The list of proposed solutions is extensive—from selecting better judges to the wisdom of the crowds to using models to training people better to more elaborate schemes like dividing the decision task and asking people to make relative than absolute judgments. The evidence backing the solutions is not always hefty, which meshes with the ideolog-like approach to evidence present everywhere in the book. When I did a small audit of the citations, three things stood out (the overarching theme is adherence to the “No Congenial Result Scrutinized or Left Uncited Act”):
- Extremely small n studies cited without qualification. Software engineers.
Quote from the book: “when the same software developers were asked on two separate days to estimate the completion time for the same task, the hours they projected differed by 71%, on average.”
The underlying paper: “In this paper, we report from an experiment where seven experienced software professionals estimated the same sixty software development tasks over a period of three months. Six of the sixty tasks were estimated twice.
- Extremely small n studies cited without qualification. Israeli Judges.
Hypoglycemia and judgment: “Our data consist of 1,112 judicial rulings, collected over 50 d in a 10-mo period, by eight Jewish-Israeli judges (two females) who preside over two different parole boards that serve four major prisons in Israel.”
- Surprising but likely unreplicable results. “When calories are on the left, consumers receive that information first and evidently think “a lot of calories!” or “not so many calories!” before they see the item. Their initial positive or negative reaction greatly affects their choices. By contrast, when people see the food item first, they apparently think “delicious!” or “not so great!” before they see the calorie label. Here again, their initial reaction greatly affects their choices. This hypothesis is supported by the authors’ finding that for Hebrew speakers, who read right to left, the calorie label has a significantly larger impact..” (Paper.)
“We show that if the effect sizes in Dallas et al. (2019) are representative of the populations, a replication of the six studies (with the same sample sizes) has a probability of only 0.014 of producing uniformly significant outcomes.” (Paper.)
- Citations to HBR. Citations to think pieces in Harvard Business Review (10 citations in total based on a keyword search) and books like ‘Work Rules!’ for a fair many claims.
Here are my notes for the book.
One conventional definition of group fairness is that the ML algorithms produce predictions where the FPR (or FNR or both) is the same across groups. Fixating on equating FPR etc. can harm the very groups we are trying to help. So it may be useful to rethink how to solve the problem of reducing unfairness.
One big reason why the FPR may vary across groups is that, given the data, some groups’ outcomes are less predictable than others. This may be because of the limitations of the data itself or because of the limitations of algorithms. For instance, Kearns and Roth in their book bring up the example of college admissions. The training data for college admissions is the decisions made by college counselors. College counselors may well be worse at predicting the success of minority students because they are less familiar with their schools, groups, etc., and this, in turn, may lead to algorithms performing worse on minority students. (Assume the algorithm to be human decision-makers and the point becomes immediately clear.)
One way to address worse performance may be to estimate the uncertainty of the prediction. This allows us to deal with people with wider confidence bounds separately from people with narrower confidence bounds. The optimal strategy for people with wider confidence bounds people may be to collect additional data to become more confident in those predictions. For instance, Komiyama and Noda propose something similar (pdf) to help overcome a lack of information during hiring. Or we may need to figure out a way to compensate people based on their uncertainty interval.
The average width of the uncertainty interval across groups may also serve as a reasonable way to diagnose this particular problem.
One definition of a fair algorithm is an algorithm that yields the same FPR across groups (an example of classification parity). To achieve that, we often have to trade in some accuracy. The final model is thus less accurate but fair. There are two concerns with such models:
- Net Harm Over Relative Harm: Because of lower accuracy, the number of people from a minority group that are unfairly rejected (say for a loan application) may be a lot higher. (This is ignoring the harm done to other groups.)
- Mismeasuring Harm? Consider an algorithm used to approve or deny loans. Say we get the same FPR across groups but lower accuracy for loans with a fair algorithm. Using this algorithm, however, means that credit is more expensive for everyone. This, in turn, may cause fewer people of the vulnerable group to get loans as the bank factors in the cost of mistakes. Another way to think about the point is that using such an algorithm causes net interest paid per borrowed dollar to increase by some number. It seems this common scenario is not discussed in many of the papers on fair ML. One reason for that may be that people are fixated on who gets approved and not the interest rate or total approvals.
“To get roughly 70% of the planet’s population inoculated by April, the IMF calculates, would cost just $50bn. The cumulative economic benefit by 2025, in terms of increased global output, would be $9trn, to say nothing of the many lives that would be saved.”https://www.economist.com/leaders/2021/06/09/the-west-is-passing-up-the-opportunity-of-the-century
The Economist frames this as an opportunity for G7. And it is. But it is also an opportunity for third-world countries, which plausibly can borrow $50bn given the return on investment. The fact that money hasn’t already been allocated poses a puzzle. Is it because governments think about borrowing decisions based on whether or not a policy is tax revenue positive (which a 180x return ought to be even with low tax collection and assessment rates)? Or is it because we don’t have a marketplace where we can transact on this information? If so, it seems like an important hole.
Here’s another way to look at this point. Among countries where the profits mostly go to a few, why do the people at the top not come to invest together so that they can harvest profits later? Brunei is probably an ok example.
Two papers estimate the impact of having a daughter on Members of Congress’ (MC’s) position on women’s issues. Washington (2008) finds that each additional daughter (conditional on the number of children) causes about a 2 point increase in liberalism on women’s issues using data from the 105th to 108th Congress. Costa et. al 2019 use data from 110th to 114th Congress to find there is a noisily estimated small effect that cannot be distinguished from zero.
Same Number, Different Interpretation
Washington (2008) argues that a 2 point effect is substantive. But Costa et al. argue that a 2–3 point change is not substantively meaningful.
“In all five specifications, the score increases by about two points with each additional daughter parented. For all but the 106th Congress, the number of female children coefficient is significantly different from zero at conventional levels. While that two point increase may seem small relative to the standard deviations of these scores, note that the female legislators, on average, score a significant seven to ten points higher on these rating scores. In other words, an additional daughter has about 25% of the impact on women’s issues that one’s own gender has.”From Washington 2008
“The lower bound of the confidence interval for the first coefficient in ModelFrom Costa et. al 2019
1, the effect of having a daughter on AAUW rating, is −3.07 and the upper
bound is 2.01, meaning that the increase on the 100-point AAUW scale for
fathers of daughters could be as high as 2.01 at the 90% level, but that AAUW
score could also decrease by as much as 3.07 points for fathers of daughters,
which is in the opposite direction than previous literature and theory would
have us expect. In both directions, neither the increase nor the decrease is
substantively very meaningful.“
The two papers—Washington’s and Costa et al.—come to different conclusions. But why? Besides different data, there are fair many other differences in modeling choices including (p.s. this is not a comprehensive list):
- How the number of children are controlled for. Washington uses fixed effects for the number of children. This makes sense if you conceive the number of daughters as a random variable within people with the same number of children. Another way to think of it is as a block randomized experiment. Costa et al. write, “Following Washington (2008), we also include a control variable for the total number of children a legislator has.” But control for it linearly.
- Dummy Vs. Number of Daughters. Costa et al. have a ‘has daughter’ dummy that codes as 1 any MC with 1 or more daughter while Washington uses the number of daughters as the ‘treatment’ variable.
The primary dependent variable is votes chosen by an interest group. Doing so causes multiple issues. The first is incommensurability across time. The chosen votes are different because not only is the selection process in choosing the votes is likely different but also the selection process that goes into what things come to vote. So it could be the case that the effect hasn’t changed but the measurement instrument has. The second issue is that interest groups are incredibly strategic in choosing the votes. And that means they choose votes that don’t always have a strong, direct, unique, and obvious relationship to women’s welfare. For instance, AAUW chose the vote to confirm Neil Gorsuch as one of the votes. There are likely numerous considerations that go into voting for Neil Gorsuch, including conflicting considerations about women’s welfare. For instance, a senator who supports the women’s right to choose may vote for Neil Gorsuch even if there is concern that the judge will vote against it because they may think Gorsuch would support liberalizing the economy further which will have a beneficial impact on women’s economic status, which the senator may view as more important. Third, the number of votes chosen is tiny. For the 115th Congress, for the Senate, there are only 7 votes and only 6 for the House of Representatives. Fourth, it seems the papers treat the House of Representatives and Senate interchangeably when the votes are different. Fifth, one of the issues with imputing ideology from congressional votes is that the issues over which people get to express preferences is limited. So the implied differences are generally smaller than actual ideological differences. The point affects how we interpret the results.
“[Q]uotas are often thought of as temporary measures, used to improve the lot of particular groups of people until they can take care of themselves.”Bhavnani 2011
So how quickly can be withdraw the quota? The answer depends—plausibly on space, office, and time.
“In West Bengal …[i]n 1998, every third G[ram] P[anchayat] starting with number 1 on each list was reserved for a woman, and in 2003 every third GP starting with number 2 on each list was reserved” (Beaman et al. 2012). Beaman et al. exploit this random variation to estimate the effect of reservation in prior election cycles on women being elected in the subsequent elections. They find that 1. just 4.8% of the elected ward councillors in non-reserved wards, 2. this number doesn’t change if a GP has been reserved once before, and 3. shoots up to a still-low 10.1% if the GP has been reserved twice before (see the last column of Table 11 below).
In a 2009 article, Bhavnani, however, finds a much larger impact of reservation in Mumbai ward elections. He finds that a ward being reserved just once before causes a nearly 18 point jump (see the table below) starting from a lower base than above (3.7%).
p.s. Despite the differences, Beaman et al. footnote Bhavnani’s findings as: “Bhavnani (2008) reports similar findings for urban wards of Mumbai, where previous reservation for women improved future representation of women on unreserved seats.”
Beaman et al. also find that reservations reduce men’s biases. However, a 2018 article by Amanda Clayton finds that this doesn’t hold true (though the CI are fairly wide) in Lesotho, Kenya.
Look Ma, I Connected Some Dots!
In late 2019, in a lecture at the Watson Center at Brown University, Raghuram Rajan spoke about the challenges facing the Indian economy. While discussing the trends in growth in the Indian economy (I have linked to the relevant section in the video. see below for the relevant slide), Mr. Rajan notes:
“We were growing really fast before the great recession, and then 2009 was a year of very poor growth. We started climbing a little bit after it, but since then, since about 2012, we have had a steady upward movement in growth going back to the pre-2000, pre-financial crisis growth rates. And then since about mid-2016 (GS: a couple of years after Mr. Modi became the PM), we have seen a steady deceleration.”Raghuram Rajan at the Watson Center at Brown in 2019 explaining the graph below
The statement is supported by the red lines that connect the deepest valleys with the highest peak, eagerly eliding over the enormous variation in between (see below).
See Something, Say Some Other Thing
Not to be left behind, Mr. Rajan’s interlocutor Mr. Subramanian shares the following slide about investment collapse. Note the title of the slide and then look at the actual slide. The title says that the investment (tallied by the black line) collapses in 2010 (before Mr. Modi became PM).
Proposition 13 enacted two key changes: 1. it limited property tax to 1% of the cash value, and 2. limited annual increase of assessed value to 2%. The only way the assessed value can change by more than 2% is if the property changes hands (a loophole allows you to change hands without officially changing hands).
Take out the extremes, and the variation is still hefty. Property taxes of neighboring lots often vary by well over $20k. ) My back-of-the-envelope estimate of standard deviation based on ten properties chosen at random is $23k.)
Sample another from Stanford where the range is from ~$2k to nearly $59k.
Prop. 13 has a variety of more material perverse consequences. Property taxes are one reason by people move from their suburban houses near the city to other more remote, cheaper places. But Prop. 13 reduces the need to move out. This likely increases property prices, which in turn likely lowers economic growth as employers choose other places. And as Chaste, a long-time contributor to the blog points out, it also means that the currently employed often have to commute longer distances, which harms the environment in addition to harming the families of those who commute.
p.s. Looking at the property tax data, you see some very small amounts. For instance, $19 property tax. When Chaste dug in, he found that the property was last sold in 1990 for $220K but was assessed at $0 in 2009 when it passed on to the government. The property tax on government-owned properties and affordable housing in California is zero. And Chaste draws out the implication: “poor cities like Richmond, which are packed with affordable housing, not only are disproportionately burdened because these populations require more services, they also receive 0 in property taxes from which to provide those services.”
p.p.s. My hunch is that a political campaign that uses property taxes in CA as a targeting variable will be very successful.
p.p.p.s. Chaste adds: “Prop 13 also applies to commercial properties. Thus, big corps also get their property tax increases capped at 2%. As a result, the sales are often structured in ways that nominally preserve existing ownership.
There was a ballot proposition on the November 2020 ballot, which would have removed Prop 13 protections for commercial properties worth more than $3M. Residential properties over $3M would continue to enjoy the protection. Even this prop failed 52%-48%. People were perhaps scared that this would be the first step in removing Prop 13 protections for their own homes.”
A new paper uses financial transaction data to estimate customer churn in consumer-facing companies. The paper defines churn as follows:
There are three concerns with the definition:
- The definition doesn’t make clear what is the normalizing constant for calculating the share. Given that the value “can vary between zero and one,” presumably the normalizing constant is either a) total revenue in the same year in which customer buys products, b) total revenue in the year in which the firm revenue was greater.
- If the denominator when calculating s_fit is the total revenue in the same year in which the customer buys products from the company, it can create a problem. Consider a case where there is a customer that spends $10 in both year t and year t-k. And assume that the firm’s revenue in the same years is $10 and $20 respectively. In this case, the customer hasn’t changed his/her behavior but their share has gone from 1 to .5.
- Beyond this, there is a semantic point. Churn is generally used to refer to attrition. In this case, it covers both customer acquisition and attrition. It also covers both a reduction and an increase in customer spending.
A Fun Aside
“Netflix similarly was not in one of our focused consumer-facing industries according to our SIC classification (it is found with two-digit SIC of 78, which mostly contains movie producers)” — this tracks with my judgment of Netflix.
Doubly Robust (DR) estimators of ATE are all the rage. One popular DR estimator is Robins’ Augmented IPW (AIPW). The reason why Robins’ AIPW estimator is called doubly robust is that if either your IPW model or your y ~ x model is correctly specified, you get ATE. Great!
Calling something “doubly robust” makes you think that the estimator is robust to (common) violations of commonly made assumptions. But DR replaces one strong assumption with one marginally less strong assumption. It is common to assume that IPW or Y ~ X is right. But DR replaces either of these with the OR clause. So how common is it to get either of the models right? Basically never.
(There is one more reason to worry about the use of the word ‘robust.’ In statistics, it is used to convey robustness of to violations of distributional assumptions.)
Given the small advance in assumptions, it turns out that the results aren’t better either (and can be substantially worse):
- “None of the DR methods we tried … improved upon the performance of simple regression-based prediction of the missing values. (see here.)
- “The methods with by far the worst performance with regard to RSMSE are the Doubly Robust (DR) approaches, whose RSMSE is two or three times as large as the RSMSE for the other estimators.” (see here and the relevant table is included below.)
Some people prefer DR for efficiency. But the claim for efficiency is based on strong assumptions being met: “The local semiparametric efficiency property, which guarantees that the solution to (9) is the best estimator within its class, was derived under the assumption that both models are correct. This estimate is indeed highly efficient when the π-model is true and the y-model is highly predictive.”
p.s. When I went through some of the lecture notes posted online, I was surprised that the lecture notes explain DR as “if A or B hold, we get ATE” but do not discuss the modal case.
But What About DML?
DML is a version of DR. DML is often used for causal inference from observational data. The worries when doing causal inference from observational data remain the same with DML:
- Measurement error in variables
- Controlling for post-treatment variables
- Controlling for ‘collider’ variables
- Slim chances of y~x and AIPW (or y ~ d) being correctly specified
Here’s a paper that delves into some of the issues using DAGs. (Added 10/2/2021.)
In a new paper, Jon Mellon reviews 185 papers that use weather as an instrument and finds that researchers have linked 137 variables to weather. You can read it as each paper needing to contend with 136 violations of the exclusion restriction, but the situation is likely less dire. For one, weather as an instrument has many varietals. Some papers use local (both in time and space) fluctuations in the weather for identification. At the other end, some use long-range (both in time and space) variations in weather, e.g., those wrought upon by climate. And the variables affected by each are very different. For instance, we don’t expect long-term ‘dietary diversity’ to be affected by short-term fluctuations in the local weather. A lot of the other variables are like that. For two, the weather’s potential pathways to the dependent variable of interest are often limited. For instance, as Jon notes, it is hard to imagine how rain on election day would affect government spending any other way except its effect on the election outcome.
There are, however, some potential general mechanisms through which exclusion restriction could be violated. The first that Jon identifies is also among the oldest conjecture in social science research—weather’s effect on mood. Except that studies that purport to show the effect of weather on mood are themselves subject to selective response, e.g., when the weather is bad, more people are likely to be home, etc.
There are some other more fundamental concerns with using weather as an instrument. First, when there are no clear answers on how an instrument should be (ahem!) instrumented, the first stage of IV is ripe for specification search. In such cases, people probably pick up the formulation that gives the largest F-stat. Weather falls firmly in this camp. For instance, there is a measurement issue about how to measure rain. Should it be the amount of rain or the duration of rain, or something else? And then there is a crudeness issue of the instrument as ideally, we would like to measure rain over every small geographic unit (of time and space). To create a summary measure from crude observations, we often need to make judgments, and it is plausible that judgments that lead to a larger F-stat. are seen as ‘better.’
Second, for instruments that are correlated in time, we need to often make judgments to regress out longer-term correlations. For instance, as Jon points out, studies that estimate the effect of rain on voting on election day may control long-term weather but not ‘medium term.’ “However, even short-term studies will be vulnerable to other mechanisms acting at time periods not controlled for. For instance, many turnout IV studies control for the average weather on that day of the year over the previous decade. However, this does not account for the fact that the weather on election day will be correlated with the weather over the past week or month in that area. This means that medium-term weather effects will still potentially confound short-term studies.”
The concern is wider and includes some of the RD designs that measure the effect of ad exposure on voting, etc.
It has taken me a long time to realize that renting books is the way to go for most books. The frequency with which I go back to a book is so low that I don’t really see any returns on permanent possession that accrue from the ability to go back.
Renting also has the virtue of disciplining me: I rent when I am ready to read and it incentives me to finish the book (or graze and assess whether the book is worth finishing) before the rental period expires.
For e-books, my format of choice, buying a book is even less attractive. One reason why people buy a book is for the social returns from displaying the book on a bookshelf. E-books don’t provide that, though in time people may devise mechanisms to do just that. Another reason why people prefer buying books is that they want something ‘new.’ Once again, the concern doesn’t apply to e-books.
From a seller’s perspective, renting has the advantage of expanding the market. Sellers get money from people who would otherwise not buy the book. These people may, instead, substitute it by copying the book or borrowing it from a friend or a library or getting similar content elsewhere, e.g., Youtube or other (cheaper) books, or they may simply forgo reading the book.
The following few paragraphs are from Sociation Today:
Using the work of Elliot (et al. 1996), Gladwell compares the proportion of each class which gets a STEM degree compared to the math SAT at Hartwick College and Harvard University. Here is what he presents for Hartwick:
Students at Hartwick College
|STEM Majors||Top Third||Middle Third||Bottom Third|
So the top third of students with the Math SAT as the measure earn over half the science degrees.
What about Harvard? It would be expected that Harvard students would have much higher Math SAT scores and thus the distribution would be quite different. Here are the data for Harvard:
Students at Harvard University
|STEM Majors||Top Third||Middle Third||Bottom Third|
Gladwell states the obvious, in italics, “Harvard has the same distribution of science degrees as Hartwick,” p. 83.
Using his reference theory of being a big fish in a small pond, Gladwell asked Ms. Sacks what would have happened if she had gone to the University of Maryland and not Brown. She replied, “I’d still be in science,” p. 94.
Gladwell focuses on the fact that the bottom-third at Harvard is the same as the top third at Hartwick. And points to the fact that they graduate at very different rates. It is a fine point. But there is more to the data. The top-third at Harvard have much higher SAT scores than the top-third at Hartwick. Why is it the case that they graduate with a STEM degree at the same rate as the top-third at Hartwick? One answer to that is that STEM degrees at Harvard are harder. So harder coursework at Harvard (vis-a-vis Hartwick) is another explanation for the pattern we see in the data and, in fact, fits the data better as it explains the performance of the top-third at Harvard.
Here’s another way to put the point: If preferences for graduating in STEM are solely and almost deterministically explained by Math SAT scores, like Gladwell implicitly assumes, and the major headwinds are because of relative standing, then we should see a much higher STEM graduation rate for the top-third at Harvard. We should ideally see an intercept shift across schools, which we don’t see, but a common differential between the top and the bottom third.
“She took a position—which has actually become very popular in IndiaAmartya Sen
now, not coming from the left these days, but from the right—that what you have to concentrate on is simply maximizing economic growth. Once you have grown and become rich, then you can do health care, education, and all this other stuff. Which I think is one of the more profound errors that you can make in development planning. Somehow Joan had a lot of sympathy for that position. In fact, she strongly criticized Sri Lanka for offering highly subsidized food to everyone on nutritional grounds. I remember the phrase she used: “Sri Lanka is trying to taste the fruit of
the tree without growing it.”
“On the unemployment issue I may well be, but if I compare an economistAmartya Sen
like Keynes, who never took a serious interest in inequality, in poverty, in the environment, with Pigou, who took an interest in all of them, I don’t think I would be able to say exactly what you are asking me to say.”
On the 1943 Bengal Famine, the last big famine in India in which ~ 3M people perished:
“Basically I had figured out on the basis of the little information I had (that indeed
everyone had) that the problem was not that the British had the wrong data, but that their theory of famine was completely wrong. The government was claiming that there was so much food in Bengal that there couldn’t be a famine. Bengal, as a whole, did indeed have a lot of food—that’s true. But that’s supply; there’s also demand, which was going up and up rapidly, pushing prices sky-high. Those left behind in a boom economy—a boom generated by the war—lost out in the competition for buying food.”
“I learned also—which I knew as a child—that you could have a famine with a lot of food around. And how the country is governed made a difference. The British did not want rebellion in Calcutta. I believe no one of Calcutta died in the famine. People died in Calcutta, but they were not of Calcutta. They came from elsewhere, because what little charity there was came from Indian businessmen based in Calcutta. The starving peopleAmartya Sen
kept coming into Calcutta in search of free food, but there was really not much of that. The Calcutta people were entirely protected by the Raj to prevent discontent of established people during the war. Three million people in Calcutta had ration cards, which entailed that at least six million people were being fed at a very subsidized price of food. What the government did was to buy rice at whatever price necessary to purchase it in the rural areas, making the rural prices shoot up. The price of rationed food in Calcutta for established residents was very low and highly subsidized, though the market price in Calcutta—outside the rationing network—rose with the rural price increase.”
On John Smith
“He discussed why you have to think pragmatically about the different institutions to be combined together, paying close attention to how they respectively work. There’s a passage where he’s asking himself the question, Why do we strongly want a good political economy? Why is it important? One answer—not the only one—is that it will lead to high economic growth (this is my language, not Smith’s). I’m not quoting his words, but he talks about the importance of high growth, high rate of progress. But why is that important? He says it’s important for two distinct reasons. First, it gives the individual more income, which in turn helps people to do what they would value doing. Smith is talking here about people having more capability. He doesn’t use the word capability, but that’s what he is talking about here. More income helps you to choose the kind of life that you’d like to lead. Second, it gives the state (which he greatly valued as an institution when properly used) more revenue, allowing it to do those things which only the state can do well. As an example, he talks about the state being able to provide free school education.”Amartya Sen
In Thinking Fast and Slow, Kahneman recounts a time when Thaler, Amos, and he met a senior investment manager in 1984. Kahneman asked, “When you sell a stock, who buys it?”
“[The investor] answered with a wave in the vague direction of the window, indicating that he expected the buyer to be someone else very much like him. That was odd: What made one person buy, and the other person sell? What did the sellers think they knew that the buyers did not? [gs: and vice versa.]”
“… It is not unusual for more than 100M shares of a single stock to change hands in one day. Most of the buyers and sellers know that they have the same information; they exchange the stocks primarily because they have different opinions. The buyers think the price is too low and likely to rise, while the sellers think the price is high and likely to drop. The puzzle is why buyers and sellers alike think that the current price is wrong. What makes them believe they know more about what the price should be than the market does? For most of them, that belief is an illusion.”Thinking Fast and Slow. Daniel Kahneman
Note: Kahneman is not just saying that buyers and sellers have the same information but that they also know they have the same information.
There is a 1982 counterpart to Kahneman’s observation in the form of Paul Milgrom and Nancy Stokey’s paper on the No-Trade Theorem. “[If] [a]ll the traders in the market are rational, and thus they know that all the prices are rational/efficient; therefore, anyone who makes an offer to them must have special knowledge, else why would they be making the offer? Accepting the offer would make them a loser. All the traders will reason the same way, and thus will not accept any offers.”