Learning From the Future with Fixed Effects

6 Nov

Say that you want to predict wait times at restaurants using data with four columns: wait times (wait), the restaurant name (restaurant), time and date of observation. Using the time and date of the observation, you create two additional columns: time of the day (tod) and day of the week (dow). And say that you estimate the following model:

\text{wait} \sim  \text{restaurant} + tod + dow + \epsilon

Assume that the number of rows is about 100 times the number of columns. There is little chance of overfitting. But you still do an 80/20 train/test split and pick the model that works the best OOS.

You have every right to expect the model’s performance to be close to its OOS performance. But when you deploy the model, the model performs much worse than that. What could be going on?

In the model, we estimate a restaurant level intercept. But in estimating the intercept, we use data from all wait times, including those that happened after the date. One fix is to using rolling averages or last X wait times in the regression. Another is to more formally construct the data in such a way that you are always predicting the next wait time.

Rehabilitating Forward Stepwise Regression

6 Nov

Forward Stepwise Regression (FSR) is hardly used today. That is mostly because regularization is a better way to think about variable selection. But part of the reason for its disuse is that FSR is a greedy optimization strategy with unstable paths. Jigger the data a little and the search paths, variables in the final set, the performance of the final model, all can change dramatically. The same issues, however, affect another greedy optimization strategy—CART. The insight that rehabilitated CART was bagging—build multiple trees using random subspaces (sometimes on randomly sampled rows) and average the results. What works for CART should principally also work for FSR. If you are using FSR for prediction, you can build multiple FSR models using random subspaces and random samples of rows and then average the results. If you are using it for variable selection, you can pick variables with the highest batting average (n_selected/n_tried). (LASSO will beat it on speed but there is little reason to expect that it will beat it on results.)

Faites Attention! Dealing with Inattentive and Insincere Respondents in Experiments

11 Jul

Respondents who don’t pay attention or respond insincerely are in vogue (see the second half of the note). But how do you deal with such respondents in an experiment?

To set the context, a toy example. Say that you are running an experiment. And say that 10% of the respondents in a rush to complete the survey and get the payout don’t read the survey question that measures the dependent variable and respond randomly to it. In such cases, the treatment effect among the 10% will be centered around 0. And including the 10% would attenuate the Average Treatment Effect (ATE).

More formally, in the subject pool, there is an ATE that is E[Y(1)] – E[Y(0)].  You randomly assign folks, and under usual conditions, they render a random sample of Y(1) or Y(0), which in expectation retrieves the ATE.  But when there is pure guessing, the guess by subject i is not centered around Y_i(1) in the treatment group or Y_i(0) in the control group.  Instead, it is centered on some other value that is altogether unresponsive to treatment. 

Now that we understand the consequences of inattention, how do we deal with it?

We could deal with inattentive responding under compliance, but it is useful to separate compliance with the treatment protocol, which can be just picking up the phone, from attention or sincerity with which the respondent responds to the dependent variables. On a survey experiment, compliance plausibly adequately covers both, but cases where treatment and measurement are de-coupled, e.g., happen at different times, it is vital to separate the two.

On survey experiments, I think it is reasonable to assume that:

  1. the proportion of people paying attention are the same across Control/Treatment group, and
  2. there is no correlation between who pays attention and assignment to the control group/treatment group, e.g., men are inattentive in the treatment group and women in the control group.

If the assumptions hold, then the worst we get is an estimate on the attentive subset (principal stratification). To get at ATE with the same research design (and if you measure attention pre-treatment), we can post-stratify after estimating the treatment effect on the attentive subset and then re-weight to account for the inattentive group. (One potential issue with the scheme is that variables used to stratify may have a fair bit of measurement error among inattentive respondents.)

The experimental way to get at attenuation would be to manipulate attention, e.g., via incentives, after the respondents have seen the treatment but before the DV measurement has begun. For instance, see this paper.

Attenuation is one thing, proper standard errors another. People responding randomly will also lead to fatter standard errors, not just because we have fewer respondents but because as Ed Haertel points out (in personal communication):

  1. “The variance of the random responses could be [in fact, very likely is: GS] different [from] the variances in the compliant groups.”
  2. Even “if the variance of the random responses was zero, we’d get noise because although the proportions of random responders in the T and C groups are equal in expectation, they will generally not be exactly the same in any given experiment.”

Maximal Persuasion

21 Jun

Say that you want to persuade a group of people to go out and vote. You can reach people by phone, mail, f2f, or email. And the cost of reaching out f2f > phone > mail > email. Your objective is to convert as many people as possible. How would you do it?

Thompson sampling provides one answer. Thompson sampling “randomly allocates subjects to treatment arms according to their probability of returning the highest reward under a Bayesian posterior.”

To exploit it, start by predicting persuasion (or persuasion/$) based on whatever you know about the person, and assignment to treatment or control. Conventionally, this means using a random forest model to estimate heterogeneous treatment effects but really use whatever gets you the best fit after including interactions in the inputs. (Make sure you get calibrated probabilities back.) Use the forecasted probabilities to find the treatment arm with the highest reward and probabilistically assign people to that.

Here’s the fun part: the strategy also accounts for compliance. The kinds of people who don’t ‘comply’ with one method, e.g., don’t pick up the phone, will be likelier to be assigned to another method.

The Value of Bad Models

18 Jun

This is not a note about George Box’s quote about models. Neither is it about explainability. The first is trite. And the second is a mug’s game.

Imagine the following: you get hundreds of emails a day, and someone must manually sort which emails are urgent and which are not. The process is time-consuming. So you want to build a model. You estimate that a model with an error rate of 5% or less will save time—the additional work from addressing the erroneous five will be outweighed by the “free” correct classification of the other 95.

Say that you build a model. And if you dichotomize at p = .5, the model accurately classifies 70% of all emails. Even though the accuracy is less than 95%, should we put the model in production?

Often, the answer is yes. When you put such a model in production, it generally saves effort right away. Here’s how. If you get people to (continue to) manually classify the emails that the model is uncertain about, say with p-values between .3 and .7, the accuracy of the model on the rest of rows is generally vastly higher. More generally, you can choose the cut-offs for which humans need to code in a way that reduces the error to an acceptable level. And then use a hybrid approach to capitalize on the savings and like Matthew 22:21, render to model the region where the model does well, and to humans the rest.

Snakes on Ladders: Encouraging People to Climb the Engagement Ladder

3 Jun

Marketers love engagement ladders. To increase engagement with a product, many companies segment their users based on usage, for instance, into heavy (super), medium (average), and light, and prod their users to climb the ladder by suggesting they do things that people in the segment above them are doing and which they aren’t doing (as frequently).

At first blush, it sounds reasonable, even obvious. The trouble with the seemingly obvious, however, is that a) it gives the illusion of understanding, which prevents us from thinking carefully (because there is nothing more to understand!), and b) it doesn’t always make sense.

Let’s start by assuming that the ladder metaphor makes sense. The only thing that we need to do is to implement it correctly.

The ladder metaphor is built on the idea of stable rungs. If the classification into “light”, “medium”, and “heavy” is not durable—for instance, if someone classified as “heavy” can move to “light” next month on their own accord—what we learn by comparing “heavy” users to “medium” users may prove deleterious for the “medium” users.

Thus, it is useful to have stable rungs. To build stable rungs, start by assessing the stability of rungs by building transition matrices over time. If the rungs are not durable over time frames over which you want to see an effect, bolster them by extending the observation time over which usage is measured or using multiple measures. For instance, if usage over the last month does not produce durable rungs, it may be because usage is heavily seasonal. To fix that, switch to usage over multiple months or a seasonally adjusted number.

Once you have stable rungs, the next task is to come up with a set of actions that marketers can encourage users to take. The popular method to arbitrate between potential actions is to regress adjacent rungs on the set of potential actions and find the ones that are most highly correlated or have the highest beta. The popular method may seem reasonable but it isn’t. Assume away causality and you still care about how useful, actionable, and easy a recommended action is. The highest beta doesn’t mean the lowest cost per incremental improvement (again, assuming away causal concerns and taking betas at face value). And there is no way to address such concerns without experimenting and finding out what works best. (The message that works the best is a sum of the action being recommended and how that action is being encouraged.)

There is one minor nuance to the above. It pays to have ‘no action’ as an action if ‘no action’ isn’t your control group. Usage-based sorting merely sorts the users by kinds of people—by people who don’t need to use the product more often than thrice a month versus those who do. Who are we to say that they need to use the product more? Fact is that often enough the correlation between usage and retention is small. And doing nothing may prove better than annoying people with unwanted emails.

Lastly, the ladder metaphor leads some to believe that we need to stand up the same ladder for everyone. Using the highest beta or the most effective treatment means recommending the same (best) action to everyone. This is what I call the ‘mail merge’ heuristic. Mail merge is plausibly very highly correlated with the usage of MS-Word. But it would be an utter disaster if MSFT recommended it to me—I plan to quit the MSFT ecosystem if it comes to pass. Ideally, we want to encourage people to cross rungs by using more things in the software that are useful for them. (In fact, it isn’t clear how else we can induce a user to use the software more.) You can learn different ladders by modeling heterogeneity in treatment effects and then use simple algebra to find the best one for each person.

Wanted: Effects That Support My Hypothesis

8 May

Do survey respondents account for the hypothesis that they think people fielding the survey have when they respond? The answer, according to Mummolo and Peterson, is not much.

Their paper also very likely provides the reason why—people don’t pay much attention. Figure 3 provides data on manipulation checks—the proportion guessing the hypothesis being tested correctly. The change in proportion between control and treatment ranges from -.05 to .25, with a bulk of changes in Qualtrics between 0 and .1. (In one condition, authors even offer an additional 25 cents to give a result consistent with the hypothesis. And presumably, people need to know the hypothesis before they can answer in line with it.) The faint increase is especially noteworthy given that on average, the proportion of people in the control group who guess the hypothesis correctly—without the guessing correction—is between .25–.35 (see Appendix B; pdf).

So, the big thing we may have learned from the data is how little attention survey respondents pay. The numbers obtained here are similar to those in Appendix D of Jonathan Woon’s paper (pdf). The point is humbling and suggests that we need to: a) invest more in measurement, and b) have yet larger samples, which is an expensive way to overcome measurement error—a point Gelman has made before.

There is also the point about the worthiness of including ‘manipulation checks.’ Experiments tell us ATE of what we manipulate. The role of manipulation checks is to shed light on ‘compliance.’ If conveying experimenter demand clearly and loudly is a goal, then the experiments included probably failed. If the purpose was to know whether clear but not very loud cues about ‘demand’ matter—and for what it’s worth, I think it is a very reasonable goal; pushing further, in my mind, would have reduced the experiment to a tautology—the paper provides the answer.

Interview with InfoQ

26 Apr

I recently gave an interview to InfoQ about my paper (and associated open source software) on predicting the race and ethnicity of a person using the sequence of characters in a name.

Here a relevant excerpt:

InfoQ: Can you discuss how we can learn from names? What ML/DL algorithms can we use?

Gaurav Sood:  Learning more about a person from their name is no different from tackling any other supervised ML problem. It all starts with getting (or creating) a large labeled corpus. For instance, one key innovation in ethnicolr is the training data—we use voting registration files to get a large labeled corpus. In another project on learning from names, I scraped Google Image Search results to build the training data for inferring the gender from a name.

Once you have the data, find ways to exploit patterns in the data to learn a model. Some early ventures exploited the fact that names of different kinds of people began/ended differently. For instance, female names in India often end with an ‘a,’ and you can exploit that pattern to infer gender from Indian names. In ethnicolr, we generalize this intuition and use patterns in sequences of characters. (I am also working on exploiting sequences of sounds.) Like Ye et al., you could also rely on the fact that we correspond more frequently with co-ethnics and exploit email networks for building your models.

To exploit the patterns in the data, the full-range of DL/ML tools is available to you. Use what works best.

Estimating the Trend at a Point in a Noisy Time Series

17 Apr

Trends in time series are valuable. If the cost of a product rises suddenly, it likely indicates a sudden shortfall in supply or a sudden rise in demand. If the cost of claims filed by a patient rises sharply, it plausibly suggests rapidly worsening health.

But how do we estimate the trend at a particular time in a noisy time series? The answer is simple: smooth the time series using any one of the many methods, local polynomials or via GAMs or similar such methods, and then estimate the derivative(s) of the function at the chosen point in time. Smoothing out the noise is essential. If you don’t smooth and instead go with a naive estimate of the derivative, it can be heavily negatively correlated with derivatives gotten from smoothed time series. For instance, in an example we present, the correlation is –.47.

Clarification

Sometimes we want to know what the “trend” was over a particular time window. But what that means is not 100% clear. For a synopsis of the issues, see here.

Python Package

incline provides a couple of ways of approximating the underlying function for the time series:

  • fitting a local higher order polynomial via Savitzky-Golay over a window of choice
  • fitting a smoothing spline

The package provides a way to estimate the first and second derivative at any given time using either of those methods. Beyond these smarter methods, the package also provides a way a naive estimator of slope—average change when you move one-step forward (step = observed time units) and one-step backward. Users can also calculate the average or maximum slope over a time window (over observed time steps).

What Clicks With the Users? Maximizing CTR

17 Apr

Given a pool of messages, how can you maximize CTR?

The problem of maximizing CTR reduces to the problem of estimating the probability that a person in a specific context will click on each of the messages. Once you have the probabilities, all you need to do is apply the max operator and show the message with the highest probability. Technically, you don’t need to get the point estimates right—you just need to get the ranking right.

Abstracting out, there are four levers for increasing CTR:

  1. Better models and data: Posed as a supervised problem, we are aiming to learn clicks as a function of a) the kind of content, b) the kind of context, and c) the kinds of people. (And, of course, interactions between all three are included.) To learn preferences well, we need to improve your understanding of the content, context, and kinds of people. For instance, to understanding content more finely, you may need to code font size, font color, etc.
  2. Modeling externalities (user learning): It sounds funny when you say that CTR of a system that shows no messages to some people some of the time can be better than a system that shows at least some message to everyone every time they log in. But it can be true. If you need to increase CTR over longer horizons, you need to be able to model the impact of showing one message on a person opening another message. If you do that, you may realize that the best option is to not even show a message this time. (The other way you could ‘improve’ CTR is by losing people—you may lose people you bombard with irrelevant messages and the only people who ‘survive’ are those who like what you send.)
  3. Experimenting With How to Present a Message: Location on the webpage, the font, etc. all may matter. Experiment to learn.
  4. Portfolio: This let’s go of the fixed portfolio. Increase your portfolio of messages so that you have a reasonable set of things for everyone. It is easy enough to mistake people dismissing a message with disinterest in receiving messages. Don’t make the mistake. If you want to learn where you are failing, find out for which kinds of people you have the lowest (calibrated) probability scores for and think hard about what kinds of messages will appeal to these kinds of people.

A/B Testing Recommendation Systems

1 Apr

Say that you are building a news recommender that lists which relevant news items in each person’s news feed. Say that your first version of the news recommender is a rules-based system that uses signals like how many people in your network have seen the news, how many people in total have read the news, the freshness of the news, etc., and sums up the signals in an arbitrary way to rank news items. Your second version uses the same signals but uses a supervised model to decide on the optimal weights.

Say that you find that the recommendations vary a fair bit between the two systems. But which one is better? To suss that, you conduct an A/B test. But a naive experiment will produce biased estimates of the effect and the s.e. because:

  1. The signals on which your control group ranking system on is based are influenced by the kinds of news articles that people in treatment group see. And vice versa.
  2. There is an additional source of stochasticity in recommendations that people see: the order in which people arrive matters.

The effect of the first concern is that our estimates are likely attenuated.  To resolve the first issue, show people in the Control Group news articles based on predicted views of news articles based on historical data or pro-rated views of people assigned to control group alone. (This adds a bit of noise to the Control Group estimates.) And keep a separate table of input data for the treatment group and apply the ML model to the pro-rated data from that table.

The consequence of the second issue is that our s.e. is very plausibly much larger than what we will get with the split world testing (each condition gets its own table of counts for views, etc.). The sequence in which people arrive matters as it intersects with social influence world. To resolve the second issue, you need to estimate how the sequence of arrival affects outcomes. But given the number of pathways, the best we can probably do is bound. We could probably estimate the effect of ranking the least downloaded item first as a way to bound the effects.

p.s. The social influence world doesn’t report s.e. but this paper based on Salganik/Watts paper reports incorrect ones as it implicitly assumes that the sequence of arrival doesn’t matter.

Siamese Networks for Record Linkage

20 Mar

For the uninitiated:

A siamese neural network consists of twin networks which accept distinct inputs but are joined by an energy function at the top. This function computes some metric between the highest level feature representation on each side. The parameters between the twin networks are tied. Weight tying guarantees that two extremely similar images could not possibly be mapped by their respective networks to very different locations in feature space because each network computes the same function.

One Shot

Replace the word images with two representations of the same record across any two tables and you have an algorithm for producing good distance functions for efficient record linkage. Triplet loss is a natural extension to this. Looking forward to seeing some bottom line results comparing it to generic supervised results, which reminds me of the fact that I am unaware of any large benchmark datasets for the fundamental problem of statistical record linkage.

The Risk of Misunderstanding Risk

20 Mar

Women who participate in breast cancer screening from 50 to 69 live on average 12 more days. This is the best case scenario. Gerd has more such compelling numbers in his book, Calculated Risks. Gerd shares such numbers to launch a front on assault on the misunderstanding of risk. His key point is:

“Overcoming innumeracy is like completing a three-step program to statistical literacy. The first step is to defeat the illusion of certainty. The second step is to learn about the actual risks of relevant eventsand actions. The third step is to communicate the risks in an understandable way and to draw inferences without falling prey to clouded thinking.”

Gerd’s key contributions are on the third point. Gerd identifies three problems with risk communication:

  1. using relative risk than Numbers Needed to Treat (NNT) or absolute risk,
  2. Using single-event probabilities, and
  3. Using conditional probabilities than ‘natural frequencies.’

Gerd doesn’t explain what he means by natural frequencies in the book but some of his other work does. Here’s a clarifying example that illustrates how the same information can be given in two different ways, the second of which is in the form of natural frequencies:

“The probability that a woman of age 40 has breast cancer is about 1 percent. If she has breast cancer, the probability that she tests positive on a screening mammogram is 90 percent. If she does not have breast cancer, the probability that she nevertheless tests positive is 9 percent. What are the chances that a woman who tests positive actually has breast cancer?”

vs.

“Think of 100 women. One has breast cancer, and she will probably test positive. Of the 99 who do not have breast cancer, 9 will also test positive. Thus, a total of 10 women will test positive. How many of those who test positive actually have breast cancer?”

For those in a hurry, here are my notes on the book.

What’s Best? Comparing Model Outputs

10 Mar

Let’s assume that you have a large portfolio of messages: n messages of k types. And say that there are n models, built by different teams, that estimate how relevant each message is to the user on a particular surface at a particular time. How would you rank order the messages by relevance, understood as the probability a person will click on the relevant substance of the message?

Isn’t the answer: use the max. operator as a service? Just using the max. operator can be a problem because of:

a) Miscalibrated probabilities: the probabilities being output from non-linear models are not always calibrated. A probability of .9 doesn’t mean that there is a 90% chance that people will click it.

b) Prediction uncertainty: prediction uncertainty for an observation is a function of the uncertainty in the betas and distance from the bulk of the points we have observed. If you were to randomly draw a 1,000 samples each from the estimated distribution of p, a different ordering may dominate than the one we get when we compare the means.

This isn’t the end of the problems. It could be that the models are built on data that doesn’t match the data in the real world. (To discover that, you would need to compare expected error rate to actual error rate.) And the only way to fix the issue is to collect new data and build new models of it.

Comparing messages based on propensity to be clicked is unsatisfactory. A smarter comparison would take optimize for profit, ideally over the long term. Moving from clicks to profits requires reframing. Profits need not only come from clicks. People don’t always need to click on a message to be influenced by a message. They may choose to follow-up at a later time. And the message may influence more than the person clicking on the message. To estimate profits, thus, you cannot rely on observational data. To estimate the payoff for showing a message, which is equal to the estimated winning minus the estimated cost, you need to learn it over an experiment. And to compare payoffs of different messages, e.g., encourage people to use a product more, encourage people to share the product with another person, etc., you need to distill the payoffs to the same currency—ideally, cash.

5 is smaller than 1.9!

10 Feb

“In the late 1990s, the leading methods caught about 80 percent of fraudulent transactions. These rates improved to 90–95 percent in 2000 and to 98–99.9 percent today. That last jump is a result of machine learning; the change from 98 percent to 99.9 percent has been transformational.

An improvement from 85 percent to 90 percent accuracy means that mistakes fall by one-third. An improvement from 98 percent to 99.9 percent means mistakes fall by a factor of twenty. An improvement of twenty no longer seems incremental.”


From Prediction Machines by Agarwal, Gans, and Goldfarb.

One way to compare the improvements is to compare differences in percentages —5 and 1.9. That is what I would have done. That is so because conditional on the same difference in percentages, lower the base, the greater the multiplicative factor, which makes it a cheap way of making small improvements look better. Even then, for consistency, the comparison would have been between percentage increases in accuracy, between (90 – 85)/85 and (99.9 – 98)/98. But, AGG had to flip the estimand to percentage errors to make the latter relative change look better.

AutoSum Plus

23 Nov

Nearly four years ago, I released autosum. Autosum exploits work by other scientists to harvest key points from (and key concerns with) a paper. The software grabs the sentence before or after the citation to build that knowledge. The output is pretty useful. See for yourself. But you could do one better by using it as a label for supervised text summarization tasks. You could learn the BERT embeddings and then use them to predict key phrases (or more).

Making an Impression: Learning from Google Ads

31 Oct

Broadly, Google Ads works as follows: 1. Advertisers create an ad, choose keywords, and make a bid (on cost-per-click or CPC) (You can bid on cost-per-view and cost-per-impression also, but we limit our discussion to CPC.), 2. the Google Ads account team vets whether the keywords are related to the product being advertised, and 3. people see the ad from the winning bid when they search for a term that includes the keyword or when they browse content that is related to the keyword (some Google Ads are shown on sites that use Google AdSense).

There is a further nuance to the last step. Generally, on popular keywords, Google has thousands of candidate ads to choose from. And Google doesn’t simply choose the ad from the winning bid. Instead, it uses data to choose an ad (or a few ads) that yield the most profit (Click Through Rate (CTR)*bid). (Google probably has a more complex user utility function and doesn’t show ads below a low predicted CTR*bid.) In all, who Google shows ads to depends on the predicted CTR and the money it will make per click.

Given this setup, we can reason about the audience for an ad. First, the higher the bid, the broader the audience. Second, it is not clear how well Google can predict CTR per ad conditional on keyword bid especially when the ad run is small. And if that is so, we expect Google to show the ad with the highest bid to a random subset of people searching for the keyword or browsing content related to the keyword. Under such conditions, you can use the total number of impressions per demographic group as an indicator of interest in the keyword. For instance, if you make the highest bid on the keyword ‘election’ and you find that total number of impressions that your ad makes among people 65+ are 10x more than people between ages 18-24, under some assumptions, e.g., similar use of ad blockers, similar rates of clicking ads conditional on relevance (which would become same as predicted relevance), similar utility functions (that is younger people are not more sensitive to irritation from irrelevant ads than older people), etc., you can infer relative interest of 18-24 versus 65+ in elections.

The other case where you can infer relative interest in a keyword (topic) from impressions is when ad markets are thin. For common keywords like ‘elections,’ Google generally has thousands of candidate ads for national campaigns. But if you only want to show your ad in a small geographic area or an infrequently searched term, the candidate set can be pretty small. If your ad is the only one, then your ad will be shown wherever it exceeds some minimum threshold of predicted CTR*bid. Assuming a high enough bid, you can take the total number of impressions of an ad as a proxy for total searches for the term and how often people browsed related content.

With all of this in mind, I discuss results from a Google Ads campaign. More here.

Canonical Insights

20 Oct

If the canonical insight of computer science is automating repetition, the canonical insight of data science is optimization. It isn’t that computer scientists haven’t thought about optimization. They have. But computer scientists weren’t the first to think about automation, just like economists weren’t the first to think that incentives matter. Automation is just the canonical, foundational, purpose of computer science.

Similarly, optimization is the canonical, foundational purpose of data science. Data science aims to provide the “optimal” action at time t conditional on what you know. And it aims to do that by learning from data optimally. For instance, if the aim is to separate apples from oranges, the aim of supervised learning is to give the best estimate of whether the fruit is an apple or an orange given data.

For certain kinds of problems, the optimal way to learn from data is not to exploit found data but to learn from new data collected in an optimal way. For instance, randomized inference also us to compare two arbitrary regimes. And say if you want to optimize persuasiveness, you need to continuously experiment with different pitches (the number of dimensions on which pitches can be generated can be a lot), some of which exploit human frailties (which vary by people) and some that will exploit the fact that people need to be pitched the relevant value and that relevant value differs across people.

Once you know the canonical insight of a discipline, it opens up all the problems that can be “solved” by it. It also tells you what kind of platform you need to build to make optimal decisions for that problem. For some tasks, the “platform” may be supervised learning. For other tasks, like ad persuasiveness, it may be a platform that combines supervised learning (for targeting) and experimentation (for optimizing the pitch).

Computing Optimal Cut-Offs

7 Oct

Probabilities from classification models can have two problems:

  1. Miscalibration: A p of .9 often doesn’t mean a 90% chance of 1 (assuming a dichotomous y). (You can calibrate it using isotonic regression.)

  2. Optimal cut-offs: For multi-class classifiers, we do not know what probability value will maximize the accuracy or F1 score. Or any metric for which you need to trade-off between FP and FN.

One way to solve #2 is to run the true labels (out of sample, otherwise there is concern about bias) and probabilities through a brute-force optimizer and gives you the optimal cut-off for the metric. Here’s the script for doing the same along with an illustration.

Online Learning With Biased Sampling

3 Oct

Say that you train a model to predict who will click on an ad. Say that you deploy the model to only show ads to people who are likely to click on them. (For a discussion about the optimal strategy for who to show ads to, see here.) And say you use the clicks from the people who see the ad to continue to tune the parameters. (This is a close approximation of a standard implementation of online learning in online advertising.)

In effect, once you launch the model, you only get data from a biased set of users. Such a sampling bias can be a problem when the data generating process (how the 1s and the 0s are generated) changes in a way such that changes above the threshold (among the kinds of people who we get data from) are uncorrelated with how it changes below the threshold (among the people who we do not get data from). The concerning aspect is that if this happens, the model continues to “work,” in that the accuracy can continue to be high even as recall (the proportion of people for whom the ad is relevant) becomes lower over time. There is only one surefire way to diagnose the issue and address it: continue to collect some data from people below the threshold and learn if the data generating process is changing.