Liberalizing Daughters: Do Daughters Cause MCs to be Slightly More Liberal on Women’s Issues?

25 Dec

Two papers estimate the impact of having a daughter on Members of Congress’ (MC’s) position on women’s issues. Washington (2008) finds that each additional daughter (conditional on the number of children) causes about a 2 point increase in liberalism on women’s issues using data from the 105th to 108th Congress. Costa et. al 2019 use data from 110th to 114th Congress to find there is a noisily estimated small effect that cannot be distinguished from zero.

Same Number, Different Interpretation

Washington (2008) argues that a 2 point effect is substantive. But Costa et al. argue that a 2–3 point change is not substantively meaningful.

“In all five specifications, the score increases by about two points with each additional daughter parented. For all but the 106th Congress, the number of female children coefficient is significantly different from zero at conventional levels. While that two point increase may seem small relative to the standard deviations of these scores, note that the female legislators, on average, score a significant seven to ten points higher on these rating scores. In other words, an additional daughter has about 25% of the impact on women’s issues that one’s own gender has.”

From Washington 2008

“The lower bound of the confidence interval for the first coefficient in Model
1, the effect of having a daughter on AAUW rating, is −3.07 and the upper
bound is 2.01, meaning that the increase on the 100-point AAUW scale for
fathers of daughters could be as high as 2.01 at the 90% level, but that AAUW
score could also decrease by as much as 3.07 points for fathers of daughters,
which is in the opposite direction than previous literature and theory would
have us expect. In both directions, neither the increase nor the decrease is
substantively very meaningful.

From Costa et. al 2019

Different Numbers

The two papers—Washington’s and Costa et al.—come to different conclusions. But why? Besides different data, there are fair many other differences in modeling choices including (p.s. this is not a comprehensive list):

  1. How the number of children are controlled for. Washington uses fixed effects for the number of children. This makes sense if you conceive the number of daughters as a random variable within people with the same number of children. Another way to think of it is as a block randomized experiment. Costa et al. write, “Following Washington (2008), we also include a control variable for the total number of children a legislator has.” But control for it linearly.
  2. Dummy Vs. Number of Daughters. Costa et al. have a ‘has daughter’ dummy that codes as 1 any MC with 1 or more daughter while Washington uses the number of daughters as the ‘treatment’ variable.

Common Issues

The primary dependent variable is votes chosen by an interest group. Doing so causes multiple issues. The first is incommensurability across time. The chosen votes are different because not only is the selection process in choosing the votes is likely different but also the selection process that goes into what things come to vote. So it could be the case that the effect hasn’t changed but the measurement instrument has. The second issue is that interest groups are incredibly strategic in choosing the votes. And that means they choose votes that don’t always have a strong, direct, unique, and obvious relationship to women’s welfare. For instance, AAUW chose the vote to confirm Neil Gorsuch as one of the votes. There are likely numerous considerations that go into voting for Neil Gorsuch, including conflicting considerations about women’s welfare. For instance, a senator who supports the women’s right to choose may vote for Neil Gorsuch even if there is concern that the judge will vote against it because they may think Gorsuch would support liberalizing the economy further which will have a beneficial impact on women’s economic status, which the senator may view as more important. Third, the number of votes chosen is tiny. For the 115th Congress, for the Senate, there are only 7 votes and only 6 for the House of Representatives. Fourth, it seems the papers treat the House of Representatives and Senate interchangeably when the votes are different.

It Depends! Effect of Quotas On Women’s Representation

25 Dec

“[Q]uotas are often thought of as temporary measures, used to improve the lot of particular groups of people until they can take care of themselves.”

Bhavnani 2011

So how quickly can be withdraw the quota? The answer depends—plausibly on space, office, and time.

“In West Bengal …[i]n 1998, every third G[ram] P[anchayat] starting with number 1 on each list was reserved for a woman, and in 2003 every third GP starting with number 2 on each list was reserved” (Beaman et al. 2012). Beaman et al. exploit this random variation to estimate the effect of reservation in prior election cycles on women being elected in the subsequent elections. They find that 1. just 4.8% of the elected ward councillors in non-reserved wards, 2. this number doesn’t change if a GP has been reserved once before, and 3. shoots up to a still-low 10.1% if the GP has been reserved twice before (see the last column of Table 11 below).

From Beaman et al. 2012

In a 2009 article, Bhavnani, however, finds a much larger impact of reservation in Mumbai ward elections. He finds that a ward being reserved just once before causes a nearly 18 point jump (see the table below) starting from a lower base than above (3.7%).

From Bhavnani 2009

p.s. Despite the differences, Beaman et al. footnote Bhavnani’s findings as: “Bhavnani (2008) reports similar findings for urban wards of Mumbai, where previous reservation for women improved future representation of women on unreserved seats.”

Beaman et al. also find that reservations reduce men’s biases. However, a 2018 article by Amanda Clayton finds that this doesn’t hold true in Lesotho, Kenya.

From Clayton 2018

Political Macroeconomics

25 Dec

Look Ma, I Connected Some Dots!

In late 2019, in a lecture at the Watson Center at Brown University, Raghuram Rajan spoke about the challenges facing the Indian economy. While discussing the trends in growth in the Indian economy (I have linked to the relevant section in the video. see below for the relevant slide), Mr. Rajan notes:

“We were growing really fast before the great recession, and then 2009 was a year of very poor growth. We started climbing a little bit after it, but since then, since about 2012, we have had a steady upward movement in growth going back to the pre-2000, pre-financial crisis growth rates. And then since about mid-2016 (GS: a couple of years after Mr. Modi became the PM), we have seen a steady deceleration.”

Raghuram Rajan at the Watson Center at Brown in 2019 explaining the graph below

The statement is supported by the red lines that connect the deepest valleys with the highest peak, eagerly eliding over the enormous variation in between (see below).

See Something, Say Some Other Thing

Not to be left behind, Mr. Rajan’s interlocutor Mr. Subramanian shares the following slide about investment collapse. Note the title of the slide and then look at the actual slide. The title says that the investment (tallied by the black line) collapses in 2010 (before Mr. Modi became PM).

Epilogue

If you are looking to learn more about some of the common techniques people use to lie with charts, you can read How Charts Lie. (You can read my notes on the book here.)

No Props for Prop 13

14 Dec

Proposition 13 enacted two key changes: 1. it limited property tax to 1% of the cash value, and 2. limited annual increase of assessed value to 2%. The only way the assessed value can change by more than 2% is if the property changes hands (a loophole allows you to change hands without officially changing hands). 

One impressive result of the tax is the inequality in taxes. Sample this neighborhood in San Mateo where taxes range from $67 to nearly $300k.

Take out the extremes, and the variation is still hefty. Property taxes of neighboring lots often vary by well over $20k. ) My back of the envelope estimate of standard deviation based on ten properties chosen at random is $23k.)

Sample another from Stanford where the range is from -$2k (not clear what causes negative taxes) to nearly $59k.

Prop. 13 has a variety of more material perverse consequences. Property taxes are one reason by people move from their suburban houses near the city to other more remote, cheaper places. But Prop. 13 reduces the need to move out. This likely increases property prices, which in turn likely lowers economic growth as employers choose other places. And as Chaste, a long-time contributor to the blog points out, it also means that the currently employed often have to commute longer distances, which harms the environment in addition to harming the families of those who commute.

p.s. Looking at the property tax data, you see some very small amounts. For instance, $19 property tax. When Chaste dug in, he found that the property was last sold in 1990 for $220K but was assessed at $0 in 2009 when it passed on to the government. The property tax on government-owned properties and affordable housing in California is zero. And Chaste draws out the implication: “poor cities like Richmond, which are packed with affordable housing, not only are disproportionately burdened because these populations require more services, they also receive 0 in property taxes from which to provide those services.”

p.p.s. My hunch is that a political campaign that uses property taxes in CA as a targeting variable will be very successful.

p.p.p.s. Chaste adds: “Prop 13 also applies to commercial properties. Thus, big corps also get their property tax increases capped at 2%. As a result, the sales are often structured in ways that nominally preserves existing ownership.

There was a ballot proposition on the November 2020 ballot, which would have removed Prop 13 protections for commercial properties worth more than $3M. Residential properties over $3M would continue to enjoy the protection. Even this prop failed 52%-48%. People were perhaps scared that this would be the first step in removing Prop 13 protections for their own homes.”

Too Much Churn: Estimating Customer Churn

18 Nov

A new paper uses financial transaction data to estimate customer churn in consumer-facing companies. The paper defines churn as follows:

There are three concerns with the definition:

  1. The definition doesn’t make clear what is the normalizing constant for calculating the share. Given that the value “can vary between zero and one,” presumably the normalizing constant is either a) total revenue in the same year in which customer buys products, b) total revenue in the year in which the firm revenue was greater.
  2. If the denominator when calculating s_fit is the total revenue in the same year in which the customer buys products from the company, it can create a problem. Consider a case where there is a customer that spends $10 in both year t and year t-k. And assume that the firm’s revenue in the same years is $10 and $20 respectively. In this case, the customer hasn’t changed his/her behavior but their share has gone from 1 to .5.
  3. Beyond this, there is a semantic point. Churn is generally used to refer to attrition. In this case, it covers both customer acquisition and attrition. It also covers both a reduction and an increase in customer spending.

A Fun Aside

“Netflix similarly was not in one of our focused consumer-facing industries according to our SIC classification (it is found with two-digit SIC of 78, which mostly contains movie producers)” — this tracks with my judgment of Netflix.

Not So Robust: The Limitations of “Doubly Robust” ATE Estimators

16 Nov

Doubly Robust (DR) estimators of ATE are all the rage. One popular DR estimator is Robins’ Augmented IPW (AIPW). The reason why Robins’ AIPW estimator is called doubly robust is that if either your IPW model or your y ~ x model is correctly specified, you get ATE. Great!

Calling something “doubly robust” (DR) makes you think that the estimator is robust to (common) violations of commonly made assumptions. But DR replaces one strong assumption with one marginally less strong assumption. It is common to assume that IPW or Y ~ X are right. But DR replaces either of these with the OR clause. So how common is it to get either of the models right? Basically never. If neither model is right, you multiply the bias terms. And that ought to blow up the bias.

(There is one more reason to worry about the use of word ‘robust.’ In statistics, it is used to convey robustness of to violations of distributional assumptions.)

Given the small advance in assumptions, it turns out that the results aren’t better either (and can be substantially worse):

  1. “None of the DR methods we tried … improved upon the performance of simple regression-based prediction of the missing values. (see here.)
  2. “The methods with by far the worst performance with regard to RSMSE are the Doubly Robust (DR) approaches, whose RSMSE is two or three times as large as the RSMSE for the other estimators.” (see here and the relevant table is included below.)
From Kern et al. 2016

Some people prefer DR for efficiency. But the claim for efficiency is based on strong assumptions being met: “The local semiparametric efficiency property, which guarantees that the solution to (9) is the best estimator within its class, was derived under the assumption that both models are correct. This estimate is indeed highly efficient when the π-model is true and the y-model is highly predictive.”

p.s. When I went through some of the lecture notes posted online, I was surprised that the lecture notes explain DR as “if A or B hold, we get ATE” but do not discuss the modal case.

Instrumental Music: When It Rains, It Pours

23 Oct

In a new paper, Jon Mellon reviews 185 papers that use weather as an instrument and finds that researchers have linked 137 variables to weather. You can read it as each paper needing to contend with 136 violations of the exclusion restriction, but the situation is likely less dire. For one, weather as an instrument has many varietals. Some papers use local (both in time and space) fluctuations in the weather for identification. At the other end, some use long-range (both in time and space) variations in weather, e.g., those wrought upon by climate. And the variables affected by each are very different. For instance, we don’t expect long-term ‘dietary diversity’ to be affected by short-term fluctuations in the local weather. A lot of the other variables are like that. For two, the weather’s potential pathways to the dependent variable of interest are often limited. For instance, as Jon notes, it is hard to imagine how rain on election day would affect government spending any other way except its effect on the election outcome. 

There are, however, some potential general mechanisms through which exclusion restriction could be violated. The first that Jon identifies is also among the oldest conjecture in social science research—weather’s effect on mood. Except that studies that purport to show the effect of weather on mood are themselves subject to selective response, e.g., when the weather is bad, more people are likely to be home, etc. 

There are some other more fundamental concerns with using weather as an instrument. First, when there are no clear answers on how an instrument should be (ahem!) instrumented, the first stage of IV is ripe for specification search. In such cases, people probably pick up the formulation that gives the largest F-stat. Weather falls firmly in this camp. For instance, there is a measurement issue about how to measure rain. Should it be the amount of rain or the duration of rain, or something else? And then there is a crudeness issue of the instrument as ideally, we would like to measure rain over every small geographic unit (of time and space). To create a summary measure from crude observations, we often need to make judgments, and it is plausible that judgments that lead to a larger F-stat. are seen as ‘better.’

Second, for instruments that are correlated in time, we need to often make judgments to regress out longer-term correlations. For instance, as Jon points out, studies that estimate the effect of rain on voting on election day may control long-term weather but not ‘medium term.’ “However, even short-term studies will be vulnerable to other mechanisms acting at time periods not controlled for. For instance, many turnout IV studies control for the average weather on that day of the year over the previous decade. However, this does not account for the fact that the weather on election day will be correlated with the weather over the past week or month in that area. This means that medium-term weather effects will still potentially confound short-term studies.”

The concern is wider and includes some of the RD designs that measure the effect of ad exposure on voting, etc.

Rent-seeking: Why It Is Better to Rent than Buy Books

4 Oct

It has taken me a long time to realize that renting books is the way to go for most books. The frequency with which I go back to a book is so low that I don’t really see any returns on permanent possession that accrue from the ability to go back.

Renting also has the virtue of disciplining me: I rent when I am ready to read and it incentives me to finish the book (or graze and assess whether the book is worth finishing) before the rental period expires.

For e-books, my format of choice, buying a book is even less attractive. One reason why people buy a book is for the social returns from displaying the book on a bookshelf. E-books don’t provide that, though in time people may devise mechanisms to do just that. Another reason why people prefer buying books is that they want something ‘new.’ Once again, the concern doesn’t apply to e-books.

From a seller’s perspective, renting has the advantage of expanding the market. Sellers get money from people who would otherwise not buy the book. These people may, instead, substitute it by copying the book or borrowing it from a friend or a library or getting similar content elsewhere, e.g., Youtube or other (cheaper) books, or they may simply forgo reading the book.

STEMing the Rot: Does Relative Deprivation Explain Low STEM Graduation Rates at Top Schools?

26 Sep

The following few paragraphs are from Sociation Today:


Using the work of Elliot (et al. 1996), Gladwell compares the proportion of each class which gets a STEM degree compared to the math SAT at Hartwick College and Harvard University.  Here is what he presents for Hartwick:

Students at Hartwick College

STEM MajorsTop ThirdMiddle ThirdBottom Third
Math SAT569472407
STEM degrees55.0%27.1%17.8

So the top third of students with the Math SAT as the measure earn over half the science degrees. 

    What about Harvard?   It would be expected that Harvard students would have much higher Math SAT scores and thus the distribution would be quite different.  Here are the data for Harvard:

Students at Harvard University

STEM MajorsTop ThirdMiddle ThirdBottom Third
Math SAT753674581
STEM degrees53.4%31.2%15.4%

     Gladwell states the obvious, in italics, “Harvard has the same distribution of science degrees as Hartwick,” p. 83. 

    Using his reference theory of being a big fish in a small pond, Gladwell asked Ms. Sacks what would have happened if she had gone to the University of Maryland and not Brown. She replied, “I’d still be in science,” p. 94.


Gladwell focuses on the fact that the bottom-third at Harvard is the same as the top third at Hartwick. And points to the fact that they graduate at very different rates. It is a fine point. But there is more to the data. The top-third at Harvard have much higher SAT scores than the top-third at Hartwick. Why is it the case that they graduate with a STEM degree at the same rate as the top-third at Hartwick? One answer to that is that STEM degrees at Harvard are harder. So harder coursework at Harvard (vis-a-vis Hartwick) is another explanation for the pattern we see in the data and, in fact, fits the data better as it explains the performance of the top-third at Harvard.

Here’s another way to put the point: If preferences for graduating in STEM are solely and almost deterministically explained by Math SAT scores, like Gladwell implicitly assumes, and the major headwinds are because of relative standing, then we should see a much higher STEM graduation rate for the top-third at Harvard. We should ideally see an intercept shift across schools, which we don’t see, but a common differential between the top and the bottom third.

Amartya Sen on Keynes, Robinson, Smith, and the Bengal Famine

17 Aug

Sen in conversation with Angus Deaton and Tim Besleypdf and video.

Excepts:

On Joan Robinson

“She took a position—which has actually become very popular in India
now, not coming from the left these days, but from the right—that what you have to concentrate on is simply maximizing economic growth. Once you have grown and become rich, then you can do health care, education, and all this other stuff. Which I think is one of the more profound errors that you can make in development planning. Somehow Joan had a lot of sympathy for that position. In fact, she strongly criticized Sri Lanka for offering highly subsidized food to everyone on nutritional grounds. I remember the phrase she used: “Sri Lanka is trying to taste the fruit of
the tree without growing it.”

Amartya Sen

On Keynes:

“On the unemployment issue I may well be, but if I compare an economist
like Keynes, who never took a serious interest in inequality, in poverty, in the environment, with Pigou, who took an interest in all of them, I don’t think I would be able to say exactly what you are asking me to say.”

Amartya Sen

On the 1943 Bengal Famine, the last big famine in India in which ~ 3M people perished:

“Basically I had figured out on the basis of the little information I had (that indeed
everyone had) that the problem was not that the British had the wrong data, but that their theory of famine was completely wrong. The government was claiming that there was so much food in Bengal that there couldn’t be a famine. Bengal, as a whole, did indeed have a lot of food—that’s true. But that’s supply; there’s also demand, which was going up and up rapidly, pushing prices sky-high. Those left behind in a boom economy—a boom generated by the war—lost out in the competition for buying food.”

“I learned also—which I knew as a child—that you could have a famine with a lot of food around. And how the country is governed made a difference. The British did not want rebellion in Calcutta. I believe no one of Calcutta died in the famine. People died in Calcutta, but they were not of Calcutta. They came from elsewhere, because what little charity there was came from Indian businessmen based in Calcutta. The starving people
kept coming into Calcutta in search of free food, but there was really not much of that. The Calcutta people were entirely protected by the Raj to prevent discontent of established people during the war. Three million people in Calcutta had ration cards, which entailed that at least six million people were being fed at a very subsidized price of food. What the government did was to buy rice at whatever price necessary to purchase it in the rural areas, making the rural prices shoot up. The price of rationed food in Calcutta for established residents was very low and highly subsidized, though the market price in Calcutta—outside the rationing network—rose with the rural price increase.”

Amartya Sen

On John Smith

“He discussed why you have to think pragmatically about the different institutions to be combined together, paying close attention to how they respectively work. There’s a passage where he’s asking himself the question, Why do we strongly want a good political economy? Why is it important? One answer—not the only one—is that it will lead to high economic growth (this is my language, not Smith’s). I’m not quoting his words, but he talks about the importance of high growth, high rate of progress. But why is that important? He says it’s important for two distinct reasons. First, it gives the individual more income, which in turn helps people to do what they would value doing. Smith is talking here about people having more capability. He doesn’t use the word capability, but that’s what he is talking about here. More income helps you to choose the kind of life that you’d like to lead. Second, it gives the state (which he greatly valued as an institution when properly used) more revenue, allowing it to do those things which only the state can do well. As an example, he talks about the state being able to provide free school education.”

Amartya Sen

Trading On Overconfidence

2 May

In Thinking Fast and Slow, Kahneman recounts a time when Thaler, Amos, and he met a senior investment manager in 1984. Kahneman asked, “When you sell a stock, who buys it?”

“[The investor] answered with a wave in the vague direction of the window, indicating that he expected the buyer to be someone else very much like him. That was odd: What made one person buy, and the other person sell? What did the sellers think they knew that the buyers did not? [gs: and vice versa.]”

“… It is not unusual for more than 100M shares of a single stock to change hands in one day. Most of the buyers and sellers know that they have the same information; they exchange the stocks primarily because they have different opinions. The buyers think the price is too low and likely to rise, while the sellers think the price is high and likely to drop. The puzzle is why buyers and sellers alike think that the current price is wrong. What makes them believe they know more about what the price should be than the market does? For most of them, that belief is an illusion.”

Thinking Fast and Slow. Daniel Kahneman

Note: Kahneman is not just saying that buyers and sellers have the same information but that they also know they have the same information.

There is a 1982 counterpart to Kahneman’s observation in the form of Paul Milgrom and Nancy Stokey’s paper on the No-Trade Theorem. “[If] [a]ll the traders in the market are rational, and thus they know that all the prices are rational/efficient; therefore, anyone who makes an offer to them must have special knowledge, else why would they be making the offer? Accepting the offer would make them a loser. All the traders will reason the same way, and thus will not accept any offers.”

The Puzzle of Price Dispersion on Amazon

29 Mar

Price dispersion is an excellent indicator of transactional frictions. It isn’t that absent price dispersion, we can confidently say that frictions are negligible. Frictions can be substantial even when price dispersion is zero. For instance, if the search costs are high enough that it makes it irrational to search, all the sellers will price the good at the buyer’s Willingness To Pay (WTP). Third world tourist markets, which are full of hawkers selling the same thing at the same price, are good examples of that. But when price dispersion exists, we can be reasonably sure that there are frictions in transacting. This is what makes the existence of substantial price dispersion on Amazon compelling.

Amazon makes price discovery easy, controls some aspects of quality by kicking out sellers who don’t adhere to its policies and provides reasonable indicators of quality of service with its user ratings. But still, on nearly all items that I looked at, there was substantial price dispersion. Take, for instance, the market for a bottle of Nature Made B12 vitamins. Prices go from $8.40 to nearly $30! With taxes, the dispersion is yet greater. If the listing costs are non-zero, it is not immediately clear why sellers selling the product at $30 are in the market. It could be that the expected service quality for the $30 seller is higher except that between the highest price seller and the next highest price seller, the ratings of the highest price seller are lower (take a look at shipping speed as well). And I would imagine that the ratings (and the quality) of Amazon, which comes in with the lowest price, are the highest. More generally, I have a tough time thinking about aspects of service and quality that are worth so much that the range of prices goes from 1x to 4x for a branded bottle of vitamin pills.

One plausible explanation is that the lowest price seller has a non-zero probability of being out of stock. And the more expensive and worse-quality sellers are there to catch these low probability events. They set a price that is profitable for them. One way to think about it is that the marginal cost of additional supply rises in the way the listed prices show. If true, then there seems to be an opportunity to make money. And it is possible that Amazon is leaving money on the table.

p.s. Sales of the boxed set of Harry Potter shows a similar pattern.

It Pays to Search

28 Mar

In Reinventing the Bazaar, John McMillan discusses how search costs affect the price the buyer pays. John writes:

“Imagine that all the merchants are quoting $1[5]. Could one of them do better by undercutting this price? There is a downside to price-cutting: a reduction in revenue from any customers who would have bought from this merchant even at the higher price. If information were freely available, the price-cutter would get a compensating boost in sales as additional customers flocked in. When search costs exist, however, such extra sales may be negligible. If you incur a search cost of 10 cents or more for each merchant you sample, and there are fifty sellers offering the urn, then even if you know there is someone out there who is willing to sell it at cost, so you would save $5, it does not pay you to look for him. You would be looking for a needle in a haystack. If you visited one more seller, you would have a chance of one in fifty of that seller being the price-cutter, so the return on average from that extra price quote would be 10 cents (or $5 multiplied by 1/50), which is the same as your cost of getting one more quote. It does not pay to search.”

Reinventing the Bazaar, John McMillan

John got it wrong. It pays to search. The cost and the expected payoff for the first quote is 10 cents. But if the first quote is $15, the expected payoff for the second quote—(1/49)*$50—is greater than 10 cents. And so on.

Another way to solve for it is to come up with the expected number of quotes you need to get to get to the seller selling at $10. It is 25. Given you need to spend on average $2.50 to get a benefit of $2.50, you will gladly search.

Yet another way to think is that the worst case is that you make no money—when the $10 seller is the last one you get a quote from. But in every other case, you make money.

For the equilibrium price, you need to make assumptions. But if the buyer knows that there is a price cutter, they will all buy from him. This means that the price cutter will be the only seller remaining.

There are two related fun points. First, one of the reasons markets are competitive on price when true search costs are high is likely because people price their time remarkably low. Second, when people spend a bunch of time looking for the cheapest deal, you incentivize all the sellers selling at a high rate to lower their rates and make it better for everyone else.

Faites Attention! Dealing with Inattentive and Insincere Respondents in Experiments

11 Jul

Respondents who don’t pay attention or respond insincerely are in vogue (see the second half of the note). But how do you deal with such respondents in an experiment?

To set the context, a toy example. Say that you are running an experiment. And say that 10% of the respondents in a rush to complete the survey and get the payout don’t read the survey question that measures the dependent variable and respond randomly to it. In such cases, the treatment effect among the 10% will be centered around 0. And including the 10% would attenuate the Average Treatment Effect (ATE).

More formally, in the subject pool, there is an ATE that is E[Y(1)] – E[Y(0)].  You randomly assign folks, and under usual conditions, they render a random sample of Y(1) or Y(0), which in expectation retrieves the ATE.  But when there is pure guessing, the guess by subject i is not centered around Y_i(1) in the treatment group or Y_i(0) in the control group.  Instead, it is centered on some other value that is altogether unresponsive to treatment. 

Now that we understand the consequences of inattention, how do we deal with it?

We could deal with inattentive responding under compliance, but it is useful to separate compliance with the treatment protocol, which can be just picking up the phone, from attention or sincerity with which the respondent responds to the dependent variables. On a survey experiment, compliance plausibly adequately covers both, but cases where treatment and measurement are de-coupled, e.g., happen at different times, it is vital to separate the two.

On survey experiments, I think it is reasonable to assume that:

  1. the proportion of people paying attention are the same across Control/Treatment group, and
  2. there is no correlation between who pays attention and assignment to the control group/treatment group, e.g., men are inattentive in the treatment group and women in the control group.

If the assumptions hold, then the worst we get is an estimate on the attentive subset (principal stratification). To get at ATE with the same research design (and if you measure attention pre-treatment), we can post-stratify after estimating the treatment effect on the attentive subset and then re-weight to account for the inattentive group. (One potential issue with the scheme is that variables used to stratify may have a fair bit of measurement error among inattentive respondents.)

The experimental way to get at attenuation would be to manipulate attention, e.g., via incentives, after the respondents have seen the treatment but before the DV measurement has begun. For instance, see this paper.

Attenuation is one thing, proper standard errors another. People responding randomly will also lead to fatter standard errors, not just because we have fewer respondents but because as Ed Haertel points out (in personal communication):

  1. “The variance of the random responses could be [in fact, very likely is: GS] different [from] the variances in the compliant groups.”
  2. Even “if the variance of the random responses was zero, we’d get noise because although the proportions of random responders in the T and C groups are equal in expectation, they will generally not be exactly the same in any given experiment.”

What’s Best? Comparing Model Outputs

10 Mar

Let’s assume that you have a large portfolio of messages: n messages of k types. And say that there are n models, built by different teams, that estimate how relevant each message is to the user on a particular surface at a particular time. How would you rank order the messages by relevance, understood as the probability a person will click on the relevant substance of the message?

Isn’t the answer: use the max. operator as a service? Just using the max. operator can be a problem because of:

a) Miscalibrated probabilities: the probabilities being output from non-linear models are not always calibrated. A probability of .9 doesn’t mean that there is a 90% chance that people will click it.

b) Prediction uncertainty: prediction uncertainty for an observation is a function of the uncertainty in the betas and distance from the bulk of the points we have observed. If you were to randomly draw a 1,000 samples each from the estimated distribution of p, a different ordering may dominate than the one we get when we compare the means.

This isn’t the end of the problems. It could be that the models are built on data that doesn’t match the data in the real world. (To discover that, you would need to compare expected error rate to actual error rate.) And the only way to fix the issue is to collect new data and build new models of it.

Comparing messages based on propensity to be clicked is unsatisfactory. A smarter comparison would take optimize for profit, ideally over the long term. Moving from clicks to profits requires reframing. Profits need not only come from clicks. People don’t always need to click on a message to be influenced by a message. They may choose to follow-up at a later time. And the message may influence more than the person clicking on the message. To estimate profits, thus, you cannot rely on observational data. To estimate the payoff for showing a message, which is equal to the estimated winning minus the estimated cost, you need to learn it over an experiment. And to compare payoffs of different messages, e.g., encourage people to use a product more, encourage people to share the product with another person, etc., you need to distill the payoffs to the same currency—ideally, cash.

Expertise as a Service

3 Mar

The best thing you can say about Prediction Machines, a new book by a trio of economists, is that it is not barren. Most of the growth you see is about the obvious: the big gain from ML is our ability to predict better, and better predictions will change some businesses. For instance, Amazon will be able to move from shopping-and-then-shipping to shipping-and-then-shopping—you return what you don’t want—if it can forecast what its customers want well enough. Or, airport lounges will see reduced business if we can more accurately predict the time it takes to reach the airport.

Aside from the obvious, the book has some untended shrubs. The most promising of them is that supervised algorithms can have human judgment as a label. We have long known about the point. For instance, self-driving cars use human decisions as labels—we learn braking, steering, speed as a function of road conditions. But what if we could use expert human judgment as a label for other complex cognitive tasks? There is already software that exploits that point. Grammarly, for instance, uses editorial judgments to give advice about grammar and style. But there are so many other places where we could exploit this. You could use it to build educational tools that give guidance on better ways of doing something in real-time. You could also use it to reduce the need for experts.

p.s. The point about exploiting the intellectual property of experts deserves more attention.

Experiments Without Control

4 Jan

Say that you are in the search engine business. And say that you have built a model that estimates how relevant an ad is based on the ‘context’: search query, previous few queries, kind of device, location, and such. Now let’s assume that for context X, the rank-ordered list of ads based on expected profit is: product A, product B, and product C. Now say that you want to estimate how effective an ad for product A is in driving the sales of product A. One conventional way to estimate this is to randomly assign during serve time: for context X, serve half the people an ad for product A and serve half the people no ad. But if it is true (and you can verify this) that an ad for product B doesn’t cause people to buy product A, then you can switch the ‘no ad’ control where you are not making any money with an ad for product B. With this, you can estimate the effectiveness of ad for product A while sacrificing the least amount of revenue. Better yet, if it is true that ad for product A doesn’t cause people to buy product B, you can also at the same time get an estimate of the efficacy of ad for product B.

The Benefit of Targeting

16 Dec

What is the benefit of targeting? Why (and when) do we need experiments to estimate the benefits of targeting? And what is the right baseline to compare against?

I start with a business casual explanation, using examples to illustrate some of the issues at hand. Later in the note, I present a formal explanation to precisely describe the assumptions to clarify under what conditions targeting may be a reasonable thing to do.

Business Casual

Say that you have some TVs to sell. And say that you could show an ad about the TVs to everyone in the city for free. Your goal is to sell as many TVs as possible. Does it make sense for you to build a model to pick out people who would be especially likely to buy the TV and only show an ad to them? No, it doesn’t. Unless ads make people less likely to purchase TVs, you are always better-off reaching out to everyone.

You are wise. You use common sense to sell more TVs than the guy who spent a bunch of money building the model and selling less. You make tons of money. And you use the money to buy Honda and Mercedes dealerships. You still retain the magical power of being able to show ads to everyone for free. Your goal is to maximize profits. And selling Mercedes nets you more profit than Hondas. Should you use a model to show some people ads about Toyota and other people ads about Honda? The answer is still no. Under likely to hold assumptions, the optimal strategy is to show an ad for Mercedes first and then an ad for Toyota. (You can show the Toyota ad first if people who want to buy Mercedes won’t buy a cheaper car if they see an ad for a cheaper car first.)

But what if you are limited to only one ad? What would you do? In that case, a model may make sense. Let’s see how things may look with some fake data. Let’s compare the outcomes of four strategies: two model-based targeting strategies and two target-everyone with one ad strategies. To make things easier, let’s assume that selling Mercedes nets ten units of profits and selling Honda nets five units of profit. Let’s also assume that people will only buy something if they see an ad for their preferred product.

Continue reading here (pdf).

The Value of Predicting Bad Things

30 Oct

Foreknowledge of bad things is useful because it gives us an opportunity to a. prevent it, and b. plan for it.

Let’s refine our intuitions with a couple of concrete examples.

Many companies work super hard to predict customer ‘churn’—which customer is not going to use a product over a specific period (which can be the entire lifetime). If you know who is going to churn in advance, you can: a. work to prevent it, b. make better investment decisions based on expected cash flow, and c. make better resource allocation decisions.

Users “churn” because they don’t think the product is worth the price, which may be because a) they haven’t figured out a way to use the product optimally, b) a better product has come on the horizon, or c) their circumstances have changed. You can deal with this by sweetening the deal. You can prevent users from abandoning your product by offering them discounts. (It is useful to experiment to learn about the precise demand elasticity at various predicted levels of churn.) You can also give discounts is the form of offering some premium features free. Among people who don’t use the product much, you can run campaigns to help people use the product more effectively.

If you can predict cash-flow, you can optimally trade-off risk so that you always have cash at hand to pay your obligations. Churn can also help you with resource allocation. It can mean that you need to temporarily hire more customer success managers. Or it can mean that you need to lay off some people.

The second example is from patient care. If you could predict reasonably that someone will be seriously sick in a year’s time (and you can in many cases), you can use it to prioritize patient care, and again plan investment (if you were an insurance company) and resources (if you were a health services company).

Lastly, as is obvious, the earlier you can learn, the better you can plan. But generally, you need to trade-off between noise in prediction and headstart—things further away are harder to predict. The noise-headstart trade-off is something that should be done thoughtfully and amended based on data.

Optimal Sequence in Which to Service Orders

27 Jul

What is the optimal order in which to service orders assuming a fixed budget?

Let’s assume that we have to service orders o_1,…,…o_n, with the n orders iterated by i. Let’s also assume that for each service order, we know how the costs change over time. For simplicity, let’s assume that time is discrete and portioned in units of days. If we service order o_i at time t, we expect the cost to be c_it. Each service order also has an expiration time, j, after which the order cannot be serviced. The cost at expiration time, j, is the cost of failure and denoted by c_ij.

The optimal sequence of servicing orders is determined by expected losses—service the order first where the expected loss is the greatest. This leaves us with the question of how to estimate expected loss at time t. To come up with an expectation, we need to sum over some probability distribution. For o_it, we need the probability, p_it, that we would service o_i at t+1 till j. And then, we need to multiply p_it with c_ij. So framed, the expected loss for order i at time t =
c_it – \Sigma_{t+1}_{j} p_it * c_it

However, determining p_it is not straightforward. New items are added to the queue at t+1. On the flip side, we also get to re-prioritize at t+1. The question is if we will get to the item o_i at t+1? (It means p_it is 0 or 1.) For that, we need to forecast the kinds of items in the queue tomorrow. One simplification is to assume that items in the queue today are the same that will be in the queue tomorrow. Then, it reduces to estimating the cost of punting each item again tomorrow, sorting based on the costs at t+1, and checking whether we will get to clear the item. (We can forgo the simplification by forecasting our queue tomorrow, and each day after that till j for each item, and calculating the costs.)

If the data are available, we can tack on clearing time per order and get a better answer to whether we will clear o_it at time t or not.