Faites Attention! Dealing with Inattentive and Insincere Respondents in Experiments

11 Jul

Respondents who don’t pay attention or respond insincerely are in vogue (see the second half of the note). But how do you deal with such respondents in an experiment?

To set the context, a toy example. Say that you are running an experiment. And say that 10% of the respondents, in a rush to complete the survey and get the payout, don’t read the survey question that measures the dependent variable and respond randomly to it. In such cases, the treatment effect among the 10% will be centered around 0. And including the 10% would attenuate the Average Treatment Effect (ATE).

More formally, in the subject pool, there is an ATE that is E[Y(1)] – E[Y(0)].  You randomly assign folks, and under usual conditions, they render a random sample of Y(1) or Y(0), which in expectation retrieves the ATE.  But when there is pure guessing, the guess by subject i is not centered around Y_i(1) in the treatment group or Y_i(0) in the control group.  Instead, it is centered on some other value that is altogether unresponsive to treatment. 

Now that we understand the consequences of inattention, how do we deal with it?

We could deal with inattentive responding under compliance, but it is useful to separate compliance with the treatment protocol, which can be just picking up the phone, from attention or sincerity with which the respondent responds to the dependent variables. On a survey experiment, compliance plausibly adequately covers both, but cases where treatment and measurement are de-coupled, e.g., happen at different times, it is vital to separate the two.

On survey experiments, I think it is reasonable to assume that:

  1. the proportion of people paying attention are the same across Control/Treatment group, and
  2. there is no correlation between who pays attention and assignment to the control group/treatment group, e.g., men are inattentive in the treatment group and women in the control group.

If the assumptions hold, then the worst we get is an estimate on the attentive subset (principal stratification). To get at ATE with the same research design (and if you measure attention pre-treatment), we can post-stratify after estimating the treatment effect on the attentive subset and then re-weight to account for the inattentive group.

The experimental way to get at attenuation would be to manipulate attention, e.g., via incentives, after the respondents have seen the treatment but before the DV measurement has begun. For instance, see this paper.

Attenuation is one thing, proper standard errors another. People responding randomly will also lead to fatter standard errors, not just because we have fewer respondents but because as Ed Haertel points out (in personal communication):

  1. “The variance of the random responses could be [in fact, very likely is: GS] different [from] the variances in the compliant groups.”
  2. Even “if the variance of the random responses was zero, we’d get noise because although the proportions of random responders in the T and C groups are equal in expectation, they will generally not be exactly the same in any given experiment.”

Maximal Persuasion

21 Jun

Say that you want to persuade a group of people to go out and vote. You can reach people by phone, mail, f2f, or email. And the cost of reaching out f2f > phone > mail > email. Your objective is to convert as many people as possible. How would you do it?

Thompson sampling provides one answer. Thompson sampling “randomly allocates subjects to treatment arms according to their probability of returning the highest reward under a Bayesian posterior.”

To exploit it, start by predicting persuasion (or persuasion/$) based on whatever you know about the person, and assignment to treatment or control. Conventionally, this means using a random forest model to estimate heterogeneous treatment effects but really use whatever gets you the best fit after including interactions in the inputs. (Make sure you get calibrated probabilities back.) Use the forecasted probabilities to find the treatment arm with the highest reward and probabilistically assign people to that.

Here’s the fun part: the strategy also accounts for compliance. The kinds of people who don’t ‘comply’ with one method, e.g., don’t pick up the phone, will be likelier to be assigned to another method.

The Value of Bad Models

18 Jun

This is not a note about George Box’s quote about models. Neither is it about explainability. The first is trite. And the second is a mug’s game.

Imagine the following: you get hundreds of emails a day, and someone must manually sort which emails are urgent and which are not. The process is time-consuming. So you want to build a model. You estimate that a model with an error rate of 5% or less will save time—the additional work from addressing the erroneous five will be outweighed by the “free” correct classification of the other 95.

Say that you build a model. And if you dichotomize at p = .5, the model accurately classifies 70% of all emails. Even though the accuracy is less than 95%, should we put the model in production?

Often, the answer is yes. When you put such a model in production, it generally saves effort right away. Here’s how. If you get people to (continue to) manually classify the emails that the model is uncertain about, say with p-values between .3 and .7, the accuracy of the model on the rest of rows is generally vastly higher. More generally, you can choose the cut-offs for which humans need to code in a way that reduces the error to an acceptable level. And then use a hybrid approach to capitalize on the savings and like Matthew 22:21, render to model the region where the model does well, and to humans the rest.

Snakes on Ladders: Encouraging People to Climb the Engagement Ladder

3 Jun

Marketers love engagement ladders. To increase engagement with a product, many companies segment their users based on usage, for instance, into heavy (super), medium (average), and light, and prod their users to climb the ladder by suggesting they do things that people in the segment above them are doing and which they aren’t doing (as frequently).

At first blush, it sounds reasonable, even obvious. The trouble with the seemingly obvious, however, is that a) it gives the illusion of understanding, which prevents us from thinking carefully (because there is nothing more to understand!), and b) it doesn’t always make sense.

Let’s start by assuming that the ladder metaphor makes sense. The only thing that we need to do is to implement it correctly.

The ladder metaphor is built on the idea of stable rungs. If the classification into “light”, “medium”, and “heavy” is not durable—for instance, if someone classified as “heavy” can move to “light” next month on their own accord—what we learn by comparing “heavy” users to “medium” users may prove deleterious for the “medium” users.

Thus, it is useful to have stable rungs. To build stable rungs, start by assessing the stability of rungs by building transition matrices over time. If the rungs are not durable over time frames over which you want to see an effect, bolster them by extending the observation time over which usage is measured or using multiple measures. For instance, if usage over the last month does not produce durable rungs, it may be because usage is heavily seasonal. To fix that, switch to usage over multiple months or a seasonally adjusted number.

Once you have stable rungs, the next task is to come up with a set of actions that marketers can encourage users to take. The popular method to arbitrate between potential actions is to regress adjacent rungs on the set of potential actions and find the ones that are most highly correlated or have the highest beta. The popular method may seem reasonable but it isn’t. Assume away causality and you still care about how useful, actionable, and easy a recommended action is. The highest beta doesn’t mean the lowest cost per incremental improvement (again, assuming away causal concerns and taking betas at face value). And there is no way to address such concerns without experimenting and finding out what works best. (The message that works the best is a sum of the action being recommended and how that action is being encouraged.)

There is one minor nuance to the above. It pays to have ‘no action’ as an action if ‘no action’ isn’t your control group. Usage-based sorting merely sorts the users by kinds of people—by people who don’t need to use the product more often than thrice a month versus those who do. Who are we to say that they need to use the product more? Fact is that often enough the correlation between usage and retention is small. And doing nothing may prove better than annoying people with unwanted emails.

Lastly, the ladder metaphor leads some to believe that we need to stand up the same ladder for everyone. Using the highest beta or the most effective treatment means recommending the same (best) action to everyone. This is what I call the ‘mail merge’ heuristic. Mail merge is plausibly very highly correlated with the usage of MS-Word. But it would be an utter disaster if MSFT recommended it to me—I plan to quit the MSFT ecosystem if it comes to pass. Ideally, we want to encourage people to cross rungs by using more things in the software that are useful for them. (In fact, it isn’t clear how else we can induce a user to use the software more.) You can learn different ladders by modeling heterogeneity in treatment effects and then use simple algebra to find the best one for each person.

Wanted: Effects That Support My Hypothesis

8 May

Do survey respondents account for the hypothesis that they think people fielding the survey have when they respond? The answer, according to Mummolo and Peterson, is not much.

Their paper also very likely provides the reason why—people don’t pay much attention. Figure 3 provides data on manipulation checks—the proportion guessing the hypothesis being tested correctly. The change in proportion between control and treatment ranges from -.05 to .25, with a bulk of changes in Qualtrics between 0 and .1. (In one condition, authors even offer an additional 25 cents to give a result consistent with the hypothesis. And presumably, people need to know the hypothesis before they can answer in line with it.) The faint increase is especially noteworthy given that on average, the proportion of people in the control group who guess the hypothesis correctly—without the guessing correction—is between .25–.35 (see Appendix B; pdf).

So, the big thing we may have learned from the data is how little attention survey respondents pay. The numbers obtained here are similar to those in Appendix D of Jonathan Woon’s paper (pdf). The point is humbling and suggests that we need to: a) invest more in measurement, and b) have yet larger samples, which is an expensive way to overcome measurement error—a point Gelman has made before.

There is also the point about the worthiness of including ‘manipulation checks.’ Experiments tell us ATE of what we manipulate. The role of manipulation checks is to shed light on ‘compliance.’ If conveying experimenter demand clearly and loudly is a goal, then the experiments included probably failed. If the purpose was to know whether clear but not very loud cues about ‘demand’ matter—and for what it’s worth, I think it is a very reasonable goal; pushing further, in my mind, would have reduced the experiment to a tautology—the paper provides the answer.

Interview with InfoQ

26 Apr

I recently gave an interview to InfoQ about my paper (and associated open source software) on predicting the race and ethnicity of a person using the sequence of characters in a name.

Here a relevant excerpt:

InfoQ: Can you discuss how we can learn from names? What ML/DL algorithms can we use?

Gaurav Sood:  Learning more about a person from their name is no different from tackling any other supervised ML problem. It all starts with getting (or creating) a large labeled corpus. For instance, one key innovation in ethnicolr is the training data—we use voting registration files to get a large labeled corpus. In another project on learning from names, I scraped Google Image Search results to build the training data for inferring the gender from a name.

Once you have the data, find ways to exploit patterns in the data to learn a model. Some early ventures exploited the fact that names of different kinds of people began/ended differently. For instance, female names in India often end with an ‘a,’ and you can exploit that pattern to infer gender from Indian names. In ethnicolr, we generalize this intuition and use patterns in sequences of characters. (I am also working on exploiting sequences of sounds.) Like Ye et al., you could also rely on the fact that we correspond more frequently with co-ethnics and exploit email networks for building your models.

To exploit the patterns in the data, the full-range of DL/ML tools is available to you. Use what works best.

Estimating the Trend at a Point in a Noisy Time Series

17 Apr

Trends in time series are valuable. If the cost of a product rises suddenly, it likely indicates a sudden shortfall in supply or a sudden rise in demand. If the cost of claims filed by a patient rises sharply, it plausibly suggests rapidly worsening health.

But how do we estimate the trend at a particular time in a noisy time series? The answer is simple: smooth the time series using any one of the many methods, local polynomials or via GAMs or similar such methods, and then estimate the derivative(s) of the function at the chosen point in time. Smoothing out the noise is essential. If you don’t smooth and instead go with a naive estimate of the derivative, it can be heavily negatively correlated with derivatives gotten from smoothed time series. For instance, in an example we present, the correlation is –.47.

Clarification

Sometimes we want to know what the “trend” was over a particular time window. But what that means is not 100% clear. For a synopsis of the issues, see here.

Python Package

incline provides a couple of ways of approximating the underlying function for the time series:

  • fitting a local higher order polynomial via Savitzky-Golay over a window of choice
  • fitting a smoothing spline

The package provides a way to estimate the first and second derivative at any given time using either of those methods. Beyond these smarter methods, the package also provides a way a naive estimator of slope—average change when you move one-step forward (step = observed time units) and one-step backward. Users can also calculate the average or maximum slope over a time window (over observed time steps).

What Clicks With the Users? Maximizing CTR

17 Apr

Given a pool of messages, how can you maximize CTR?

The problem of maximizing CTR reduces to the problem of estimating the probability that a person in a specific context will click on each of the messages. Once you have the probabilities, all you need to do is apply the max operator and show the message with the highest probability. Technically, you don’t need to get the point estimates right—you just need to get the ranking right.

Abstracting out, there are four levers for increasing CTR:

  1. Better models and data: Posed as a supervised problem, we are aiming to learn clicks as a function of a) the kind of content, b) the kind of context, and c) the kinds of people. (And, of course, interactions between all three are included.) To learn preferences well, we need to improve your understanding of the content, context, and kinds of people. For instance, to understanding content more finely, you may need to code font size, font color, etc.
  2. Modeling externalities (user learning): It sounds funny when you say that CTR of a system that shows no messages to some people some of the time can be better than a system that shows at least some message to everyone every time they log in. But it can be true. If you need to increase CTR over longer horizons, you need to be able to model the impact of showing one message on a person opening another message. If you do that, you may realize that the best option is to not even show a message this time. (The other way you could ‘improve’ CTR is by losing people—you may lose people you bombard with irrelevant messages and the only people who ‘survive’ are those who like what you send.)
  3. Experimenting With How to Present a Message: Location on the webpage, the font, etc. all may matter. Experiment to learn.
  4. Portfolio: This let’s go of the fixed portfolio. Increase your portfolio of messages so that you have a reasonable set of things for everyone. It is easy enough to mistake people dismissing a message with disinterest in receiving messages. Don’t make the mistake. If you want to learn where you are failing, find out for which kinds of people you have the lowest (calibrated) probability scores for and think hard about what kinds of messages will appeal to these kinds of people.

A/B Testing Recommendation Systems

1 Apr

Say that you are building a news recommender that lists which relevant news items in each person’s news feed. Say that your first version of the news recommender is a rules-based system that uses signals like how many people in your network have seen the news, how many people in total have read the news, the freshness of the news, etc., and sums up the signals in an arbitrary way to rank news items. Your second version uses the same signals but uses a supervised model to decide on the optimal weights.

Say that you find that the recommendations vary a fair bit between the two systems. But which one is better? To suss that, you conduct an A/B test. But a naive experiment will produce biased estimates of the effect and the s.e. because:

  1. The signals on which your control group ranking system on is based are influenced by the kinds of news articles that people in treatment group see. And vice versa.
  2. There is an additional source of stochasticity in recommendations that people see: the order in which people arrive matters.

The effect of the first concern is that our estimates are likely attenuated.  To resolve the first issue, show people in the Control Group news articles based on predicted views of news articles based on historical data or pro-rated views of people assigned to control group alone. (This adds a bit of noise to the Control Group estimates.) And keep a separate table of input data for the treatment group and apply the ML model to the pro-rated data from that table.

The consequence of the second issue is that our s.e. is very plausibly much larger than what we will get with the split world testing (each condition gets its own table of counts for views, etc.). The sequence in which people arrive matters as it intersects with social influence world. To resolve the second issue, you need to estimate how the sequence of arrival affects outcomes. But given the number of pathways, the best we can probably do is bound. We could probably estimate the effect of ranking the least downloaded item first as a way to bound the effects.

Siamese Networks for Record Linkage

20 Mar

For the uninitiated:

A siamese neural network consists of twin networks which accept distinct inputs but are joined by an energy function at the top. This function computes some metric between the highest level feature representation on each side. The parameters between the twin networks are tied. Weight tying guarantees that two extremely similar images could not possibly be mapped by their respective networks to very different locations in feature space because each network computes the same function.

One Shot

Replace the word images with two representations of the same record across any two tables and you have an algorithm for producing good distance functions for efficient record linkage. Triplet loss is a natural extension to this. Looking forward to seeing some bottom line results comparing it to generic supervised results, which reminds me of the fact that I am unaware of any large benchmark datasets for the fundamental problem of statistical record linkage.