## Preference for Sons in the US: Evidence from Business Names

24 Nov

I estimate preference for passing on businesses to sons by examining how common words son and sons are compared to daughter and daughters in the names of businesses.

In the US, all businesses have to register with a state. And all states provide a way to search business names, in part so that new companies can pick names that haven’t been used before.

I begin by searching for son(s) and daughter in states’ databases of business names. But the results of searching son are inflated because of three reasons:

• son is part of many English words, from names such as Jason and Robinson to ordinary English words like mason (which can also be a name).
• son is a Korean name.
• some businesses use the wordson playfully. For instance, sonis a homonym of sun and some people use that to create names like son of a beach.

I address the first concern by using a regex that only looks at words that exactly match son or sons. But not all states allow for regex searches or allow people to download a full set of results. Where possible, I try to draw a lower bound. But still some care is needed in interpreting the results.

Data and Scripts: https://github.com/soodoku/sonny_side

In all, I find that a conservative estimate of son to daughter ratio is between 4 to 1 to 26 to 1 across states.

## Learning From the Future with Fixed Effects

6 Nov

Say that you want to predict wait times at restaurants using data with four columns: wait times (wait), the restaurant name (restaurant), time and date of observation. Using the time and date of the observation, you create two additional columns: time of the day (tod) and day of the week (dow). And say that you estimate the following model:

$\text{wait} \sim \text{restaurant} + tod + dow + \epsilon$

Assume that the number of rows is about 100 times the number of columns. There is little chance of overfitting. But you still do an 80/20 train/test split and pick the model that works the best OOS.

You have every right to expect the model’s performance to be close to its OOS performance. But when you deploy the model, the model performs much worse than that. What could be going on?

In the model, we estimate a restaurant level intercept. But in estimating the intercept, we use data from all wait times, including those that happened after the date. One fix is to using rolling averages or last X wait times in the regression. Another is to more formally construct the data in such a way that you are always predicting the next wait time.

## Rehabilitating Forward Stepwise Regression

6 Nov

Forward Stepwise Regression (FSR) is hardly used today. That is mostly because regularization is a better way to think about variable selection. But part of the reason for its disuse is that FSR is a greedy optimization strategy with unstable paths. Jigger the data a little and the search paths, variables in the final set, the performance of the final model, all can change dramatically. The same issues, however, affect another greedy optimization strategy—CART. The insight that rehabilitated CART was bagging—build multiple trees using random subspaces (sometimes on randomly sampled rows) and average the results. What works for CART should principally also work for FSR. If you are using FSR for prediction, you can build multiple FSR models using random subspaces and random samples of rows and then average the results. If you are using it for variable selection, you can pick variables with the highest batting average (n_selected/n_tried). (LASSO will beat it on speed but there is little reason to expect that it will beat it on results.)

## Faites Attention! Dealing with Inattentive and Insincere Respondents in Experiments

11 Jul

Respondents who don’t pay attention or respond insincerely are in vogue (see the second half of the note). But how do you deal with such respondents in an experiment?

To set the context, a toy example. Say that you are running an experiment. And say that 10% of the respondents, in a rush to complete the survey and get the payout, don’t read the survey question that measures the dependent variable and respond randomly to it. In such cases, the treatment effect among the 10% will be centered around 0. And including the 10% would attenuate the Average Treatment Effect (ATE).

More formally, in the subject pool, there is an ATE that is E[Y(1)] – E[Y(0)].  You randomly assign folks, and under usual conditions, they render a random sample of Y(1) or Y(0), which in expectation retrieves the ATE.  But when there is pure guessing, the guess by subject i is not centered around Y_i(1) in the treatment group or Y_i(0) in the control group.  Instead, it is centered on some other value that is altogether unresponsive to treatment.

Now that we understand the consequences of inattention, how do we deal with it?

We could deal with inattentive responding under compliance, but it is useful to separate compliance with the treatment protocol, which can be just picking up the phone, from attention or sincerity with which the respondent responds to the dependent variables. On a survey experiment, compliance plausibly adequately covers both, but cases where treatment and measurement are de-coupled, e.g., happen at different times, it is vital to separate the two.

On survey experiments, I think it is reasonable to assume that:

1. the proportion of people paying attention are the same across Control/Treatment group, and
2. there is no correlation between who pays attention and assignment to the control group/treatment group, e.g., men are inattentive in the treatment group and women in the control group.

If the assumptions hold, then the worst we get is an estimate on the attentive subset (principal stratification). To get at ATE with the same research design (and if you measure attention pre-treatment), we can post-stratify after estimating the treatment effect on the attentive subset and then re-weight to account for the inattentive group.

The experimental way to get at attenuation would be to manipulate attention, e.g., via incentives, after the respondents have seen the treatment but before the DV measurement has begun. For instance, see this paper.

Attenuation is one thing, proper standard errors another. People responding randomly will also lead to fatter standard errors, not just because we have fewer respondents but because as Ed Haertel points out (in personal communication):

1. “The variance of the random responses could be [in fact, very likely is: GS] different [from] the variances in the compliant groups.”
2. Even “if the variance of the random responses was zero, we’d get noise because although the proportions of random responders in the T and C groups are equal in expectation, they will generally not be exactly the same in any given experiment.”

## The Declining Value of Personal Advice

27 Jun

There used to be a time when before buying something, you asked your friends and peers about advice, and it was the optimal thing to do. These days, it is often not a great use of time. It is generally better to go online. Today, the Internet abounds with comprehensive, detailed, and trustworthy information, and picking the best product, judging by its quality, price, appearance, or what have you, in a slew of categories is easy to do.

As goes for advice about products, so goes for much other advice. For instance, if a coding error stumps you, your first move should be to search StackOverflow than Slack a peer. If you don’t understand a technical concept, look for a YouTube video or a helpful blog or a book than “leverage” a peer.

The fundamental point is that it is easier to get high-quality data and expert advice today than it has ever been. If your network includes the expert, bless you! But if it doesn’t, your network no longer damns you to sub-optimal information and advice. And that likely has welcome consequences for equality.

The declining value of interpersonal advice has one significant negative externality. It takes out a big way we have provided value to our loved ones. We need to think harder about how we can fill that gap.

## Maximal Persuasion

21 Jun

Say that you want to persuade a group of people to go out and vote. You can reach people by phone, mail, f2f, or email. And the cost of reaching out f2f > phone > mail > email. Your objective is to convert as many people as possible. How would you do it?

Thompson sampling provides one answer. Thompson sampling “randomly allocates subjects to treatment arms according to their probability of returning the highest reward under a Bayesian posterior.”

To exploit it, start by predicting persuasion (or persuasion/\$) based on whatever you know about the person, and assignment to treatment or control. Conventionally, this means using a random forest model to estimate heterogeneous treatment effects but really use whatever gets you the best fit after including interactions in the inputs. (Make sure you get calibrated probabilities back.) Use the forecasted probabilities to find the treatment arm with the highest reward and probabilistically assign people to that.

Here’s the fun part: the strategy also accounts for compliance. The kinds of people who don’t ‘comply’ with one method, e.g., don’t pick up the phone, will be likelier to be assigned to another method.

## Deliberation as Tautology

18 Jun

We take deliberation to be elevated discussion, meaning at minimum, discussion that is (1) substantive, (2) inclusive, (3) responsive, and (4) open-minded. That is, (1) the participants exchange relevant arguments and information. (2) The arguments and information are wide-ranging in nature and policy implications—not all of one kind, not all on one side. (3) The participants react to each other’s arguments and information. And (4) they seriously (re)consider, in light of the discussion, what their own policy attitudes should be.

Deliberative Distortions?

One way to define deliberation would be: “the extent to which the discussion is substantive, inclusive, responsive, and open-minded.” But here, we state the top-end of each as the minimum criteria. So defined, deliberation runs into two issues:

1. It’s posited beneficient effects become becomes a near tautology. If the discussion meets that high bar, how could it not refine preferences?

2. The bar for what counts as deliberation is high enough that I doubt that most deliberative mini-publics come anywhere close to meeting the ideal.

## The Value of Bad Models

18 Jun

This is not a note about George Box’s quote about models. Neither is it about explainability. The first is trite. And the second is a mug’s game.

Imagine the following: you get hundreds of emails a day, and someone must manually sort which emails are urgent and which are not. The process is time-consuming. So you want to build a model. You estimate that a model with an error rate of 5% or less will save time—the additional work from addressing the erroneous five will be outweighed by the “free” correct classification of the other 95.

Say that you build a model. And if you dichotomize at p = .5, the model accurately classifies 70% of all emails. Even though the accuracy is less than 95%, should we put the model in production?

Often, the answer is yes. When you put such a model in production, it generally saves effort right away. Here’s how. If you get people to (continue to) manually classify the emails that the model is uncertain about, say with p-values between .3 and .7, the accuracy of the model on the rest of rows is generally vastly higher. More generally, you can choose the cut-offs for which humans need to code in a way that reduces the error to an acceptable level. And then use a hybrid approach to capitalize on the savings and like Matthew 22:21, render to model the region where the model does well, and to humans the rest.

3 Jun

Marketers love engagement ladders. To increase engagement with a product, many companies segment their users based on usage, for instance, into heavy (super), medium (average), and light, and prod their users to climb the ladder by suggesting they do things that people in the segment above them are doing and which they aren’t doing (as frequently).

At first blush, it sounds reasonable, even obvious. The trouble with the seemingly obvious, however, is that a) it gives the illusion of understanding, which prevents us from thinking carefully (because there is nothing more to understand!), and b) it doesn’t always make sense.

Let’s start by assuming that the ladder metaphor makes sense. The only thing that we need to do is to implement it correctly.

The ladder metaphor is built on the idea of stable rungs. If the classification into “light”, “medium”, and “heavy” is not durable—for instance, if someone classified as “heavy” can move to “light” next month on their own accord—what we learn by comparing “heavy” users to “medium” users may prove deleterious for the “medium” users.

Thus, it is useful to have stable rungs. To build stable rungs, start by assessing the stability of rungs by building transition matrices over time. If the rungs are not durable over time frames over which you want to see an effect, bolster them by extending the observation time over which usage is measured or using multiple measures. For instance, if usage over the last month does not produce durable rungs, it may be because usage is heavily seasonal. To fix that, switch to usage over multiple months or a seasonally adjusted number.

Once you have stable rungs, the next task is to come up with a set of actions that marketers can encourage users to take. The popular method to arbitrate between potential actions is to regress adjacent rungs on the set of potential actions and find the ones that are most highly correlated or have the highest beta. The popular method may seem reasonable but it isn’t. Assume away causality and you still care about how useful, actionable, and easy a recommended action is. The highest beta doesn’t mean the lowest cost per incremental improvement (again, assuming away causal concerns and taking betas at face value). And there is no way to address such concerns without experimenting and finding out what works best. (The message that works the best is a sum of the action being recommended and how that action is being encouraged.)

There is one minor nuance to the above. It pays to have ‘no action’ as an action if ‘no action’ isn’t your control group. Usage-based sorting merely sorts the users by kinds of people—by people who don’t need to use the product more often than thrice a month versus those who do. Who are we to say that they need to use the product more? Fact is that often enough the correlation between usage and retention is small. And doing nothing may prove better than annoying people with unwanted emails.

Lastly, the ladder metaphor leads some to believe that we need to stand up the same ladder for everyone. Using the highest beta or the most effective treatment means recommending the same (best) action to everyone. This is what I call the ‘mail merge’ heuristic. Mail merge is plausibly very highly correlated with the usage of MS-Word. But it would be an utter disaster if MSFT recommended it to me—I plan to quit the MSFT ecosystem if it comes to pass. Ideally, we want to encourage people to cross rungs by using more things in the software that are useful for them. (In fact, it isn’t clear how else we can induce a user to use the software more.) You can learn different ladders by modeling heterogeneity in treatment effects and then use simple algebra to find the best one for each person.

## Why do We Fail? And What to do About It?

28 May

I recently read Gawande’s The Checklist Manifesto. (You can read my review of the book here and my notes on the book here.) The book made me think harder about failure and how to prevent it. Here’s a result of that thinking.

We fail because we don’t know or because we don’t execute on what we know (Gorovitz and MacIntyre). Of the things that we don’t know are things that no else knows either—they are beyond humanity’s reach for now. Ignore those for now. This leaves us with things that “we” know but the practitioner doesn’t.

Practitioners do not know because the education system has failed them, because they don’t care to learn, or because the production of new knowledge outpaces their capacity to learn. Given that, you can reduce ignorance by 1) increase the length of training, b) improving the quality of training, c) setting up continued education, d) incentivizing knowledge acquisition, e) reducing the burden of how much to know by creating specializations, etc. On creating specialties, Gawande has a great example: “there are pediatric anesthesiologists, cardiac anesthesiologists, obstetric anesthesiologists, neurosurgical anesthesiologists, …”

Ignorance, however, ought not to damn the practitioner to error. If you know that you don’t know, you can learn. Ignorance, thus, is not a sufficient condition for failure. But ignorance of ignorance is. To fix overconfidence, leading people through provocative, personalized examples may prove useful.

Ignorance and ignorance about ignorance are but two of the three reasons for why we fail. We also fail because we don’t execute on what we know. Practitioners fail to apply what they know because they are distracted, lazy, have limited attention and memory, etc. To solve these issues, we can a) reduce distractions, b) provide memory aids, c) automate tasks, d) train people on the importance of thoroughness, e) incentivize thoroughness, etc.

Checklists are one way to work toward two inter-related aims: educating people about the necessary steps needed to make a decision and aiding memory. But awareness of steps is not enough. To incentivize people to follow the steps, you need to develop processes to hold people accountable. Audits are one way to do that. Meetings set up at appropriate times during which people go through the list is another way.