The paper (pdf) makes the case that the primary reason for electoral cycles in dissents is priming. The paper notes three competing explanations: 1) caseload composition, 2) panel composition, and 3) volume of caseloads. And it “rules them out” by regressing case type, panel composition, and caseload on quarters from the election (see Appendix Table D). The coefficients are uniformly small and insignificant. But is that enough to rule out alternate explanations? No. Small coefficients don’t imply that there is no path from proximity to the election via competing mediators to dissent (if you were to use causal language). We can only conclude that the pathway doesn’t exist if there is a sharp null. The best you can do is bound the estimated effect.
I estimate preference for passing on businesses to sons by examining how common words
son and sons are compared to
daughter and daughters in the names of businesses.
In the US, all businesses have to register with a state. And all states provide a way to search business names, in part so that new companies can pick names that haven’t been used before.
I begin by searching for
daughter in states’ databases of business names. But the results of searching
son are inflated because of three reasons:
sonis part of many English words, from names such as
Robinsonto ordinary English words like mason (which can also be a name).
sonis a Korean name.
- some businesses use the word
sonplayfully. For instance,
sonis a homonym of sun and some people use that to create names like
son of a beach.
I address the first concern by using a regex that only looks at words that exactly match
sons. But not all states allow for regex searches or allow people to download a full set of results. Where possible, I try to draw a lower bound. But still some care is needed in interpreting the results.
Data and Scripts: https://github.com/soodoku/sonny_side
In all, I find that a conservative estimate of son to daughter ratio is between 4 to 1 to 26 to 1 across states.
Say that you want to predict wait times at restaurants using data with four columns: wait times (wait), the restaurant name (restaurant), time and date of observation. Using the time and date of the observation, you create two additional columns: time of the day (tod) and day of the week (dow). And say that you estimate the following model:
Assume that the number of rows is about 100 times the number of columns. There is little chance of overfitting. But you still do an 80/20 train/test split and pick the model that works the best OOS.
You have every right to expect the model’s performance to be close to its OOS performance. But when you deploy the model, the model performs much worse than that. What could be going on?
In the model, we estimate a restaurant level intercept. But in estimating the intercept, we use data from all wait times, including those that happened after the date. One fix is to using rolling averages or last X wait times in the regression. Another is to more formally construct the data in such a way that you are always predicting the next wait time.
Forward Stepwise Regression (FSR) is hardly used today. That is mostly because regularization is a better way to think about variable selection. But part of the reason for its disuse is that FSR is a greedy optimization strategy with unstable paths. Jigger the data a little and the search paths, variables in the final set, the performance of the final model, all can change dramatically. The same issues, however, affect another greedy optimization strategy—CART. The insight that rehabilitated CART was bagging—build multiple trees using random subspaces (sometimes on randomly sampled rows) and average the results. What works for CART should principally also work for FSR. If you are using FSR for prediction, you can build multiple FSR models using random subspaces and random samples of rows and then average the results. If you are using it for variable selection, you can pick variables with the highest batting average (n_selected/n_tried). (LASSO will beat it on speed but there is little reason to expect that it will beat it on results.)
Respondents who don’t pay attention or respond insincerely are in vogue (see the second half of the note). But how do you deal with such respondents in an experiment?
To set the context, a toy example. Say that you are running an experiment. And say that 10% of the respondents, in a rush to complete the survey and get the payout, don’t read the survey question that measures the dependent variable and respond randomly to it. In such cases, the treatment effect among the 10% will be centered around 0. And including the 10% would attenuate the Average Treatment Effect (ATE).
More formally, in the subject pool, there is an ATE that is E[Y(1)] – E[Y(0)]. You randomly assign folks, and under usual conditions, they render a random sample of Y(1) or Y(0), which in expectation retrieves the ATE. But when there is pure guessing, the guess by subject i is not centered around Y_i(1) in the treatment group or Y_i(0) in the control group. Instead, it is centered on some other value that is altogether unresponsive to treatment.
Now that we understand the consequences of inattention, how do we deal with it?
We could deal with inattentive responding under compliance, but it is useful to separate compliance with the treatment protocol, which can be just picking up the phone, from attention or sincerity with which the respondent responds to the dependent variables. On a survey experiment, compliance plausibly adequately covers both, but cases where treatment and measurement are de-coupled, e.g., happen at different times, it is vital to separate the two.
On survey experiments, I think it is reasonable to assume that:
- the proportion of people paying attention are the same across Control/Treatment group, and
- there is no correlation between who pays attention and assignment to the control group/treatment group, e.g., men are inattentive in the treatment group and women in the control group.
If the assumptions hold, then the worst we get is an estimate on the attentive subset (principal stratification). To get at ATE with the same research design (and if you measure attention pre-treatment), we can post-stratify after estimating the treatment effect on the attentive subset and then re-weight to account for the inattentive group.
The experimental way to get at attenuation would be to manipulate attention, e.g., via incentives, after the respondents have seen the treatment but before the DV measurement has begun. For instance, see this paper.
Attenuation is one thing, proper standard errors another. People responding randomly will also lead to fatter standard errors, not just because we have fewer respondents but because as Ed Haertel points out (in personal communication):
- “The variance of the random responses could be [in fact, very likely is: GS] different [from] the variances in the compliant groups.”
- Even “if the variance of the random responses was zero, we’d get noise because although the proportions of random responders in the T and C groups are equal in expectation, they will generally not be exactly the same in any given experiment.”
There used to be a time when before buying something, you asked your friends and peers about advice, and it was the optimal thing to do. These days, it is often not a great use of time. It is generally better to go online. Today, the Internet abounds with comprehensive, detailed, and trustworthy information, and picking the best product, judging by its quality, price, appearance, or what have you, in a slew of categories is easy to do.
As goes for advice about products, so goes for much other advice. For instance, if a coding error stumps you, your first move should be to search StackOverflow than Slack a peer. If you don’t understand a technical concept, look for a YouTube video or a helpful blog or a book than “leverage” a peer.
The fundamental point is that it is easier to get high-quality data and expert advice today than it has ever been. If your network includes the expert, bless you! But if it doesn’t, your network no longer damns you to sub-optimal information and advice. And that likely has welcome consequences for equality.
The only cases where advice from people near you may edge ahead of readily available help online is where the advisor has access to private information about your case or where the advisor is willing to expend greater elbow grease to get to the facts and think of advice that aptly takes account of your special circumstances. For instance, you may be able to get good advice on how to deal with alcoholic parents from an expert online but probably not about alcoholic parents with the specific set of deficiencies that your parents have. Short of such cases, the value of advice from people around is lower today than before, and probably lower than what you can get online.
The declining value of interpersonal advice has one significant negative externality. It takes out a big way we have provided value to our loved ones. We need to think harder about how we can fill that gap.
Say that you want to persuade a group of people to go out and vote. You can reach people by phone, mail, f2f, or email. And the cost of reaching out f2f > phone > mail > email. Your objective is to convert as many people as possible. How would you do it?
Thompson sampling provides one answer. Thompson sampling “randomly allocates subjects to treatment arms according to their probability of returning the highest reward under a Bayesian posterior.”
To exploit it, start by predicting persuasion (or persuasion/$) based on whatever you know about the person, and assignment to treatment or control. Conventionally, this means using a random forest model to estimate heterogeneous treatment effects but really use whatever gets you the best fit after including interactions in the inputs. (Make sure you get calibrated probabilities back.) Use the forecasted probabilities to find the treatment arm with the highest reward and probabilistically assign people to that.
Here’s the fun part: the strategy also accounts for compliance. The kinds of people who don’t ‘comply’ with one method, e.g., don’t pick up the phone, will be likelier to be assigned to another method.
We take deliberation to be elevated discussion, meaning at minimum, discussion that is (1) substantive, (2) inclusive, (3) responsive, and (4) open-minded. That is, (1) the participants exchange relevant arguments and information. (2) The arguments and information are wide-ranging in nature and policy implications—not all of one kind, not all on one side. (3) The participants react to each other’s arguments and information. And (4) they seriously (re)consider, in light of the discussion, what their own policy attitudes should be.Deliberative Distortions?
One way to define deliberation would be: “the extent to which the discussion is substantive, inclusive, responsive, and open-minded.” But here, we state the top-end of each as the minimum criteria. So defined, deliberation runs into two issues:
1. It’s posited beneficient effects become becomes a near tautology. If the discussion meets that high bar, how could it not refine preferences?
2. The bar for what counts as deliberation is high enough that I doubt that most deliberative mini-publics come anywhere close to meeting the ideal.
This is not a note about George Box’s quote about models. Neither is it about explainability. The first is trite. And the second is a mug’s game.
Imagine the following: you get hundreds of emails a day, and someone must manually sort which emails are urgent and which are not. The process is time-consuming. So you want to build a model. You estimate that a model with an error rate of 5% or less will save time—the additional work from addressing the erroneous five will be outweighed by the “free” correct classification of the other 95.
Say that you build a model. And if you dichotomize at p = .5, the model accurately classifies 70% of all emails. Even though the accuracy is less than 95%, should we put the model in production?
Often, the answer is yes. When you put such a model in production, it generally saves effort right away. Here’s how. If you get people to (continue to) manually classify the emails that the model is uncertain about, say with p-values between .3 and .7, the accuracy of the model on the rest of rows is generally vastly higher. More generally, you can choose the cut-offs for which humans need to code in a way that reduces the error to an acceptable level. And then use a hybrid approach to capitalize on the savings and like Matthew 22:21, render to model the region where the model does well, and to humans the rest.
Marketers love engagement ladders. To increase engagement with a product, many companies segment their users based on usage, for instance, into heavy (super), medium (average), and light, and prod their users to climb the ladder by suggesting they do things that people in the segment above them are doing and which they aren’t doing (as frequently).
At first blush, it sounds reasonable, even obvious. The trouble with the seemingly obvious, however, is that a) it gives the illusion of understanding, which prevents us from thinking carefully (because there is nothing more to understand!), and b) it doesn’t always make sense.
Let’s start by assuming that the ladder metaphor makes sense. The only thing that we need to do is to implement it correctly.
The ladder metaphor is built on the idea of stable rungs. If the classification into “light”, “medium”, and “heavy” is not durable—for instance, if someone classified as “heavy” can move to “light” next month on their own accord—what we learn by comparing “heavy” users to “medium” users may prove deleterious for the “medium” users.
Thus, it is useful to have stable rungs. To build stable rungs, start by assessing the stability of rungs by building transition matrices over time. If the rungs are not durable over time frames over which you want to see an effect, bolster them by extending the observation time over which usage is measured or using multiple measures. For instance, if usage over the last month does not produce durable rungs, it may be because usage is heavily seasonal. To fix that, switch to usage over multiple months or a seasonally adjusted number.
Once you have stable rungs, the next task is to come up with a set of actions that marketers can encourage users to take. The popular method to arbitrate between potential actions is to regress adjacent rungs on the set of potential actions and find the ones that are most highly correlated or have the highest beta. The popular method may seem reasonable but it isn’t. Assume away causality and you still care about how useful, actionable, and easy a recommended action is. The highest beta doesn’t mean the lowest cost per incremental improvement (again, assuming away causal concerns and taking betas at face value). And there is no way to address such concerns without experimenting and finding out what works best. (The message that works the best is a sum of the action being recommended and how that action is being encouraged.)
There is one minor nuance to the above. It pays to have ‘no action’ as an action if ‘no action’ isn’t your control group. Usage-based sorting merely sorts the users by kinds of people—by people who don’t need to use the product more often than thrice a month versus those who do. Who are we to say that they need to use the product more? Fact is that often enough the correlation between usage and retention is small. And doing nothing may prove better than annoying people with unwanted emails.
Lastly, the ladder metaphor leads some to believe that we need to stand up the same ladder for everyone. Using the highest beta or the most effective treatment means recommending the same (best) action to everyone. This is what I call the ‘mail merge’ heuristic. Mail merge is plausibly very highly correlated with the usage of MS-Word. But it would be an utter disaster if MSFT recommended it to me—I plan to quit the MSFT ecosystem if it comes to pass. Ideally, we want to encourage people to cross rungs by using more things in the software that are useful for them. (In fact, it isn’t clear how else we can induce a user to use the software more.) You can learn different ladders by modeling heterogeneity in treatment effects and then use simple algebra to find the best one for each person.