## The Nonscience of Machine Learning

29 Aug

In 2013, Girshick et al. released a paper that described a technique to solve an impossible-sounding problem—classifying each pixel of an image (or semantic segmentation). The technique that they proposed, R-CNN, combines deep learning, selective search, and SVM. It also has all sorts of ad hoc choices, from the size of the feature vector to the number of regions, that are justified by how well they work in practice. R-CNN is not unusual. Many machine learning papers are recipes that ‘work.’ There is a reason for that. Machine learning is an engineering discipline. It isn’t a scientific one.

You may think that engineering must follow science, but often it is the other way round. For instance, we learned how to build things before we learned the science behind it—we trialed-and-errored and overengineered our way to many still standing buildings while the scientific understanding slowly accumulated. Similarly, we were able to predict the seasons and the phases of the moon before learning how our solar system worked. Our ability to solve problems with machine learning is similarly ahead of our ability to put it on a firm scientific basis.

Often, we build something based on some vague intuition, find that it ‘works,’ and only over time, deepen our intuition about why (and when) it works. Take, for instance, Dropout. The original paper (released in 2012, published in 2014) had the following as motivation:

A motivation for Dropout comes from a theory of the role of sex in evolution (Livnat et al., 2010). Sexual reproduction involves taking half the genes of one parent and half of the other, adding a very small amount of random mutation, and combining them to produce an offspring. The asexual alternative is to create an offspring with a slightly mutated copy of the parent’s genes. It seems plausible that asexual reproduction should be a better way to optimize individual fitness because a good set of genes that have come to work well together can be passed on directly to the offspring. On the other hand, sexual reproduction is likely to break up these co-adapted sets of genes, especially if these sets are large and, intuitively, this should decrease the fitness of organisms that have already evolved complicated coadaptations. However, sexual reproduction is the way most advanced organisms evolved. …

Srivastava et al. 2014, JMLR

Moreover, the paper provided no proof and only some empirical results. It took until Gal and Ghahramani’s 2016 paper (released in 2015) to put the method on a firmer scientific footing.

Then there are cases where we have made ad hoc choices that ‘work’ and where no one will ever come up with a convincing theory. Instead, progress will mean replacing bad advice with good. Take, for instance, the recommended step of ‘normalizing’ variables before doing k-means clustering or before doing regularized regression. The idea of normalization is simple enough: put each variable on the same scale. But it is also completely weird. Why should we put each variable on the same scale? Some variables are plausibly more substantively important than others and we ideally want to prorate by that.

### What Can We Learn?

The first point is about teaching machine learning. Bricklaying is thought to be best taught via apprenticeship. And core scientific principles are thought to be best taught via books and lecturing. Machine learning is closer to the bricklaying end of the spectrum. First, there is a lot in machine learning that is ad hoc and beyond scientific or even good intuitive explanation and hence taught as something you do. Second, there is plausibly much to be learned in seeing how others trial-and-error and come up with kludges to fix the issues for which there is no guidance.

The second point is about the maturity of machine learning. Over the last few decades, we have been able to accomplish really cool things with machine learning. And these accomplishments detract us from how early we are. The fact is that we have been able to achieve cool things with very crude tools. For instance, OOS validation is a crude but very commonly used tool for preventing overfitting—we stop optimization when the OOS error starts increasing. As our scientific understanding deepens, we will likely invent better tools. The best of machine learning is a long way off. And that is exciting.

## Fairly Certain: Using Uncertainty in Predictions to Diagnose Roots of Unfairness

8 Jul

One conventional definition of group fairness is that the ML algorithms produce predictions where the FPR (or FNR or both) is the same across groups. Fixating on equating FPR etc. can harm the very groups we are trying to help. So it may be useful to rethink how to solve the problem of reducing unfairness.

One big reason why the FPR may vary across groups is that, given the data, some groups’ outcomes are less predictable than others. This may be because of the limitations of the data itself or because of the limitations of algorithms. For instance, Kearns and Roth in their book bring up the example of college admissions. The training data for college admissions is the decisions made by college counselors. College counselors may well be worse at predicting the success of minority students because they are less familiar with their schools, groups, etc., and this, in turn, may lead to algorithms performing worse on minority students. (Assume the algorithm to be human decision-makers and the point becomes immediately clear.)

One way to address worse performance may be to estimate the uncertainty of the prediction. This allows us to deal with people with wider confidence bounds separately from people with narrower confidence bounds. The optimal strategy for people with wider confidence bounds people may be to collect additional data to become more confident in those predictions. For instance, Komiyama and Noda propose something similar (pdf) to help overcome a lack of information during hiring. Or we may need to figure out a way to compensate people based on their uncertainty interval.

The average width of the uncertainty interval across groups may also serve as a reasonable way to diagnose this particular problem.

## Optimal Data Collection When Strata and Strata Variances Are Known

8 Jul

With Ken Cor.

What’s the least amount of data you need to collect to estimate the population mean with a particular standard error? For the simplest case—estimating the mean of a binomial variable using simple random sampling, a conservative estimate of the variance (p=.5), and a ±3 confidence interval—the answer (n∼1,000) is well known. The simplest case, however, assumes little to no information. Often, we know more. In opinion polling, we generally know sociodemographic strata in the population. And we have historical data on the variability in strata. Take, for instance, measuring support for Mr. Obama. A polling company like YouGov will usually have a long time series, including information about respondent characteristics. Using this data, the company could derive how variable the support for Mr. Obama is among different sociodemographic groups. With information about strata and strata variances, we can often poll fewer people (vis-a-vis random sampling) to estimate the population mean with a particular s.e. In a note (pdf), we show how.

### Why bother?

In a realistic example, we find the benefit of using optimal allocation over simple random sampling is 6.5% (see the code block below).

Assuming two groups a and b, and using the notation in the note (see the pdf)—wa denotes the proportion of group a in the population, vara and varb denote the variances of group a and b respectively, and letting p denote sample mean, we find that if you use the simple random sampling formula, you will estimate that you need to sample 1095 people. If you optimally exploit the information about strata and strata variances, you will need to just sample 1024 people.

## The Benefit of Using Optimal Allocation Rules
## wa = .8
## vara = .25; pa = .5
## varb = .16; pb = .8
## SRS: pop_mean of .8*.5 + .2*.8 = .56

# sqrt(p(1 -p)/n) = .015
# n = p*(1- p)/.015^2 = 1095

# optimal_n_plus_allocation(.8, .25, .16, .015)
#   n   na   nb
#1024  853  171

Github Repo.: https://github.com/soodoku/optimal_data_collection/

## Beyond yhat: Developing ML Products

7 Jul

Making useful products is hard. Making useful ML products is harder still, in part because there are a larger number of moving parts in an ML system. To understand the issues at stake, let’s go over the **basics** of developing an ML product.

Often, product development starts with a business problem. And your first job is to understand the business problem as well as you can, familiarizing yourself with as much detail as possible.

Let’s say the problem is as follows: A company gets a lot of customer emails. All the emails go to a common inbox from which specialist customer agents fish out emails that are relevant to them. For instance, finance specialists fish out billing emails. And technical specialists fish out emails about technical errors. Fishing is time-consuming and chaotic.

Once you understand the precise problem—time taken to discover and assign emails—work on developing solutions for the problem. When developing solutions, the bias should be toward solving the problem the best way possible than injecting custom ML into whatever solution you propose. For instance, you could propose a solution that makes it easier to search (using no ML or off-the-shelf ML) and bulk assign to a new queue. But let’s say that after careful consideration of costs and benefits, a particularly appealing solution is a system that uses machine learning to automatically direct relevant emails to specialist inboxes, obviating the need to fish. That’s a start to a solution, not the end. You need to spend enough time thinking about the solution so that you have thought about how to handle edge cases, e.g., when there is a technical issue about billing, a misclassified email, etc., and any spillover issues, like the latency of such a system, how implementing such a system may break existing data pipelines that measure the total number of emails, etc.

Next, you need to define the KPIs. How much time will be saved? What is the total cost of the saved time? How many mistakes is the system making? What is the cost of handling mistakes?

Next, you need to turn the business problem into a precise machine learning problem. What labels would you predict? How would you collect the initial labels?

Once the outline of the solution has been agreed upon, you need to don your architect’s hat and outline a system diagram. Wearing the data engineer’s hat, figure out where the data needed for training and for live classification is stored, and how you would build a pipeline for training and serving the model. This is also the time to understand what guarantees, if any, exist on the data, and how you can test those guarantees.

Last, you must wear an operator’s hat. Wearing that you answer the operational nitty-gritty of how to introduce a new product. This is the time when you work with stakeholders to stand up dashboards to monitor the system, develop a rollout strategy, and a rollback strategy, a dashboard for monitoring A/B tests, etc.

The key to wearing an architect’s hat is to not only designing a system but also to make sure that enough logging is in place for different parts of the system for you to triage failures. So part of the dashboard would display logs from different parts of the system.

## Equilibrium Fairness: How “Fair” Algorithms Can Hurt Those They Purport to Help

7 Jul

One definition of a fair algorithm is an algorithm that yields the same FPR across groups (an example of classification parity). To achieve that, we often have to trade in some accuracy. The final model is thus less accurate but fair. There are two concerns with such models:

1. Net Harm Over Relative Harm: Because of lower accuracy, the number of people from a minority group that are unfairly rejected (say for a loan application) may be a lot higher. (This is ignoring the harm done to other groups.)
2. Mismeasuring Harm? Consider an algorithm used to approve or deny loans. Say we get the same FPR across groups but lower accuracy for loans with a fair algorithm. Using this algorithm, however, means that credit is more expensive for everyone. This, in turn, may cause fewer people of the vulnerable group to get loans as the bank factors in the cost of mistakes. Another way to think about the point is that using such an algorithm causes net interest paid per borrowed dollar to increase by some number. It seems this common scenario is not discussed in many of the papers on fair ML. One reason for that may be that people are fixated on who gets approved and not the interest rate or total approvals.

## No Stopping: Impact of the Stopping Rule on the Sex Ratio

20 Jun

For social scientists brought up to worry about bias stemming from stopping data collection when results look significant, the fact that a gender based stopping rule has no impact on the sex ratio seems suspect. So let’s dig deeper.

Let there be n families and let the stopping rule be that after the birth of a male child, the family stops procreating. Let p be the probability a male child is born and q=1−p

After 1 round:

$\frac{pn}{n} = p$

After 2 rounds:

$\frac{(pn + qpn)}{(n + qn)} = \frac{(p + pq)}{(1 + q)} = \frac{p(1 + q)}{(1 + q)}$

After 3 rounds:

$\frac{(pn + qpn + q^2pn)}{(n + qn + q^2n)}\\ = \frac{(p + pq + q^2p)}{(1 + q + q^2)}$

After k rounds:

$\frac{(pn + qpn + q^2pn + … + q^kpn)}{(n + qn + q^2n + \ldots q^kn)}$

After infinite rounds:

Total male children:

$= pn + qpn + q^2pn + \ldots\\ = pn (1 + q + q^2 + \ldots)\\ = \frac{np}{(1 – q)}$

Total children:

$= n + qn + q^2n + \ldots\\ = n (1 + q + q^2 + \ldots)\\ = \frac{n}{(1 – q)}$

Prop. Male:

$= \frac{np}{(1 – q)} * \frac{(1 – q)}{n}\\ = p$

If it still seems like a counterintuitive result, here’s one way to think: In each round, we get pq^k successes, and the total number of kids increases by q^k. Yet another way to think is that for any child that is born, the data generating process is unchanged.

The male-child stopping rule may not affect the aggregate sex ratio. But it does cause changes in families. For instance, it causes a negative correlation between family size and the proportion of male children. For instance, if your first child is male, you stop. (For more results in this vein, see here.)

But why does this differ from our intuition that comes from early stopping in experiments? Easy. We define early stopping as when we stop data collection as soon as the results are significant. This causes a positive bias in the number of false-positive results (w.r.t. the canonical sample-fixed-in-advance rule). But early stopping leads to both kinds of false positives—mistakenly thinking that the proportion of females is greater than .5 and mistakenly thinking that the proportion of males is greater than .5. The rule is unbiased w.r.t. to the expected value of the proportion.

## ML (O)Ops: What Data To Collect? (part 3)

16 Jun

The first part of the series, “Improving and Deploying On-Device Models With Confidence,” is posted here. The second part, “Keeping Track of Changes,” is posted here.

With Atul Dhingra

For a broad class of machine learning problems, nitpicking over the neural net architecture is over (see, for instance, here). Instead, the focus has shifted to data. In the note below, we articulate some ways of thinking about what data to collect. In our discussion, we focus on supervised learning.

The answer to “What data to collect?” varies by where you are in the product life cycle. If you are building a new ML product and the aim is to deploy something (basic) that delivers value and then iterate on it, one answer to the question is to label easy-to-predict cases—cases that allow you to build models where the precision is high but the recall is low. The bar is whether the model can do as well as business as usual for a small set of cases. The good thing is that you can hurdle that bar another way—by coding a random sample, building a model, and choosing a threshold where the precision is greater than business as usual (read more here). For producing POCs, models built on cheap data, e.g., open-source data, which plausibly do not produce value, can also “work” though they need to be managed against the threat of poor performance reducing faith in the system.

The more conventional case is where you have a deployed model, and you want to improve its performance. There the answer to what data to collect is data that yields the highest ROI. (The answer to what data provides the highest ROI will vary over time, so we need a system that continuously answers it.) If we assume that the labeling costs for points are the same, the prioritization function reduces to ranking data by returns. To begin with, let’s assume that returns are measured by the function specified by the cost function. So, for instance, if we are looking for a model that lowers the RMSE, we would like to rank by how much reduction in RMSE we get from labeling an additional point. And naturally, we care about the test set RMSE. (You can generalize this intuition to any loss function.) So far, so good. The rub comes from the fact that there is no trivial answer to the problem.

One way to answer the question is to run experiments, sampling across Xs, or plausibly use bandits and navigate the explore-exploit tradeoff smartly. Rather than do experiments, you can also exploit the data you have to figure out the kinds of points that make the most impact on RMSE. One way to get at that is using influence functions. There are, however, a couple of challenges in using these methods. The first is that the covariate space is large and the marginal impact is small, and that means inference is noisy. The second is a more general problem. Say you find that X_1, X_2, X_3, … are the points that lead to the largest reduction in RMSE. But how do you use that knowledge to convert it into a data collection problem? Is it that we should collect replicas of X_1? Probably not. We need to generalize from these examples and come up with a statement about the “type of data” that needs to be collected, e.g., more images where the traffic sign is covered by trees. To come up with the ‘type’, we need to specify what the example is not—how does it differ from the rest of the data we have? There are a couple of ways to answer the question. The first is to use clustering (using embeddings) and then assigning someone to label the clusters. Another is to use supervised learning to classify the X_1, X_2, X_3 from the rest of the data and figure out the “important predictors.”

There are other answers to the question, “What data to collect?” For instance, we could look to label points where we are least certain or where we make the largest error. The intuition in the classification setting is that these points are closest to the hyperplane that separates the classes, and if you can learn to classify near the boundary, you are set. In using this method, you can also sometimes discover mislabeling. (The RMSE method we talk about above doesn’t interrogate the Y, taking the labels as given.)

Another way to answer the question is to use model interpretation tools to figure out “why” the models are making errors. For instance, you could find that the reason why the model is making errors is because of confounding. Famously, for instance, a cat vs. dog classifier can merely be an outdoor vs. indoor classifier. And if we see the model using confounding features like the background in consideration, we could a) better label the data to segment out dogs and cats from the background, b) introduce paired examples such that the only thing different between any two images is strictly presence or absence of a dog/cat.

## The True Ones: Best Guess of True Proportion of 1s

30 May

ML models are generally used to make predictions about individual observations. Sometimes, however, the business decision is based on aggregate data. For example, say a company sells pants and wants to know how many will be returned over a certain period. Say the company has an ML model that predicts the chance a customer will return a pant. A natural thing to do would be to use the individual returns to get an expected return count.

One way to get an expected return count, if the model produces calibrated probabilities, is to simply take the mean. But say that you built an ML model to predict a dichotomous variable and you only have access to categorized outputs (1s and 0s). Say for model X, for cat == 1, the OOS recall is r and precision = p. Let’s say we use the model to predict labels for another dataset. Let’s say we observe 100 1s and 200 0s. What is the best estimate of the true proportion of 1s in the new dataset?

The quantity of interest = TP + FN

TP + FN = TP/r

TP = (TP + FP)*p

TP + FN = ((TP + FP)*p)/r = 100*p/r

(TP + FN)/n = 100p/300r = p/3r

## ML (O)Ops! Keeping Track of Changes (Part 2)

22 Mar

The first part of the series, “Improving and Deploying On-Device Models With Confidence”, is posted here.

With Atul Dhingra

One way to automate classification is to compare new instances to a known list and plug in the majority class of the exact match. For such instance-based learning, you often don’t need to version data; you just need a hash table. When you are not relying on an exact match—most machine learning—you often need to version data to reproduce the behavior.

Reproducibility is the bedrock of mature software engineering. It is fundamental because it allows you to diagnose issues. You can reproduce the behavior of a ‘version.’ With that power, you can correlate changes in inputs with changes in outputs. Systems that enable reproducibility, like version control, have another vital purpose—reducing risk stemming from changes and allow regression testing in systems that depend on data, such as ML. They reduce it by allowing for changes to be rolled back.

To reproduce outputs from machine learning models, we need to do more than store data. We also need to store hyper-parameters, details about the OS, programming language, and packages, among other things. But given the primary value of reproducibility is instrumental—diagnosis—we not just want the ability to reproduce but also the ability to understand changes and correlate them. Current solutions miss the mark.

## Current Solutions and Problems

One way to version data is to treat it as a binary blob. Store all the data you learned a model on to a server and store a reference to the data in your repository. If the data changes, store the new version and create a new pointer. One downside of using a <code>git lfs</code> like mechanism is that your storage blows up. Another is that build times can be large if the local cache is small or more generally if access costs are large. Yet another problem is the lack of a neat interface that allows you to track more than source data.

DVC purports to solve all three problems. It solves the first by providing a way to not treat the data as a blob. For instance, in a computer vision workflow, the source data is image files with some elementary tags—labels, assignments to train and test, etc. The differences between data versions are 1) changes in images (additions mostly) and 2) changes in mapping to labels and assignments. DVC allows you to store the differences in corpora of images as a list of additional hashes to files. DVC is silent on the second point—efficient storage of changes in mappings. We come to it later. DVC purports to solve the second problem by allowing you to save to local cloud storage. But it can still be time-consuming to download data from the cloud storage buckets. The reason is as follows. Each time you want to work on an experiment, you need to clone the entire cache to check out the appropriate files. And if not handled properly, the cloning time often significantly exceeds typical training times. Worse, it locks you into a cloud provider for any optimizations you may want to alleviate these time-bound cache downloads. DVC purports to solve the last problem by using yaml, tags, etc. But anarchy prevails.

## Future Solutions

Interpretable Changes

One of the big problems with data versioning is that the diffs are not human-readable, much less comprehensible. The diffs are usually very long, and the changes in the diff are hashes, which means that to review an MR/PR/Diff, the reviewer has to check out the change and pull the data with the updated hashes. The process can be easily improved by adding an extra layer that auto-summarizes the changes into a human-readable form. We can, of course, easily do more. We can provide ways to understand how changes to inputs correlate with changes in outputs.

Diff. Tables

The standard method of understanding data as a blob seems uniquely bad. For conventional rectangular databases, changes can be understood as changes in functional transformations of core data lake tables. For instance, say we store the label assignments of images in a table. And say we revise the labels of 100 images. (The core data lake tables are immutable, so the changes are executed in the downstream tables.) One conventional way of storing the changes is to use a separate table for recording changes. Another is to write an update statement that is run whenever “the v2” table is generated. This means the differences across data are now tied to a data transformation computation graph. When data transformation is inexpensive, we can delay running the transformations till the table is requested. In other cases, we can cache the tables.

## ML (O)Ops! Improving and Deploying On-Device Models With Confidence (Part 1)

21 Feb

With Atul Dhingra.

Part 1 of a multi-part series.

It is well known that ML Engineers today spend most of their time doing things that do not have a lot to do with machine learning. They spend time working on technically unsophisticated but important things like deployment of models, keeping track of experiments, etc.—operations. Atul and I dive into the reasons behind the status quo and propose solutions, starting with issues to do with on-device deployments.

Performance on Device

The deployment of on-device models is complicated by the fact that the infrastructure used for training is different from what is used for production. This leads to many tedious rollbacks.

The underlying problem is missing data. We are missing data on the latency in prediction, which is a function of i/o latency and time taken to compute. One way to impute the missing data is to build a model that predicts latency based on various features of the deployed model. Given many companies have gone through thousands of deployments and rollbacks, there is rich data to learn from. Another is to directly measure the time with ‘shadow deployments—performance on redundant chips colocated with the production chip and getting exactly the same data at about the same time (a small lag in passing on the data to the redundant chips is just fine as we can start the clock at a different time).

Predicting latency given a model and deployment architecture solves the problem of deploying reliably. It doesn’t solve the problem of how to improve the performance of the system given a model. To improve the production performance of ML systems, companies need to analyze the data, e.g., compute the correlation between load on the edge server and latency, and generate additional data by experimenting with various easily modifiable parts of the system, e.g., increasing capacity of the edge server, etc. (If you are a cloud service provider like AWS, you can learn from all the combinations of infrastructure that exist to predict latency for various architectures given a model and propose solutions to the customer.)

There is plausibly also a need for a service that helps companies decide which chip is optimal for deployment. One solution to the problem is MLPerf.org as a service— a service that provides data on the latency of a model on different chips.

## Build Software for the Lay User

14 Feb

Most word processing software helpfully point out grammatical errors and spelling mistakes. Some even autocorrect. And some, like Grammarly, even give style advice.

Now consider software used for business statistics. Say you want to compute the correlation between two vectors: [100, 2000, 300, 400, 500, 600] and [1, 2, 3, 4, 5, 17000]. Most (all?) software will output .65. (The software assume you want Pearson’s correlation.) Experts know that the relatively large value in the second vector has a large influence on the correlation. For instance, switching it to -17000 will reverse the correlation coefficient to -.65. And if you remove the last observation, the correlation is 1. But a lay user would be none the wiser. Common software, e.g., Excel, R, Stata, Google Sheets, etc., do not warn the user about the outlier and its potential impact on the result. They should.

Take another example—the fickleness of the interpretation of AUC when you have binary predictors (see here) as much depends on how you treat ties. It is an obvious but subtle point. Commonly used statistical software, however, do not warn people about the issue.

Given the rate of increase in the production of knowledge, increasingly everyone is a lay user. For instance, in 2013, Lin showed that estimating ATE using OLS with a full set of interactions improves the precision of ATE. But such analyses are uncommon in economics papers. The analysis could be absent for a variety of reasons: 1. ignorance, 2. difficulty in estimating the model, 3. do not believe the result, etc. However, only ignorance stands the scrutiny. The model is easy to estimate, so the second explanation is unlikely to explain much. The last explanation also seems unlikely, given the result was published in a prominent statistical journal and experts use it.

If ignorance is the primary explanation, should the onus of being well informed about the latest useful discoveries in methods fall on researchers working in a substantive area? Plausibly. But that is clearly not working very well. One way to accelerate the dissemination of useful discoveries is via software, where you can provide such guidance as ‘warnings.’

The guidance can be put in manually. Or we can use machine learning, exploiting the strategy used by Grammarly, which uses expert editors to edit lay user sentences and uses that as training data.

We can improve science by building software that provides better guidance. The worst-case for such software is probably business-as-usual, where some researchers get bad advice and many get no advice.

## Superhuman: Can ML Beat Human-Level Performance in Supervised Models?

20 Dec

A supervised model cannot do better than its labels. (I revisit this point later.) So the trick is to make labels as good as you can. The errors in labels stem from three sources:

1. Lack of Effort: More effort people spend labeling something, presumably the more accurate it will be.
2. Unclear Directions: Unclear directions can result from a. poorly written directions, b. conceptual issues, c. poor understanding. Let’s tackle conceptual issues first. Say you are labeling the topic of news articles. Say you come across an article about how Hillary Clinton’s hairstyle has evolved over the years. Should it be labeled as politics, or should it labeled as entertainment (or my preferred label: worthless)? It depends on taste and the use case. Whatever the decision, it needs to be codified (and clarified) in the directions given to labelers. Poor writing is generally a result of inadequate effort.
3. Hardness: Is that a 4 or a 7? We have all suffered at the hands of CAPTCHA to know that some tasks are harder than others.

The fix for the first problem is obvious. To increase effort, incentivize. Incentivize by paying for correctness—measured over known-knowns—or by penalizing mistakes. And by providing feedback to people on the money they lost or how much more others with a better record made.

Solutions for unclear directions vary by the underlying problem. To address conceptual issues, incentivize people to flag (and comment on) cases where the directions are unclear and build a system to collect and review prediction errors. To figure out if the directions are unclear, quiz people on comprehension and archetypal cases.

Can ML Performance Be Better Than Humans?

If humans label the dataset, can ML be better than humans? The first sentence of the article suggests not. Of course, we have yet to define what humans are doing. If the benchmark is labels provided by a poorly motivated and trained workforce and the model is trained on labels provided by motivated and trained people, ML can do better. The consensus label provided by a group of people will also generally be less noisy than one provided by a single person.

Andrew Ng brings up another funny way ML can beat humans—by not learning from human labels very well.

When training examples are labeled inconsistently, an A.I. that beats HLP on the test set might not actually perform better than humans in practice. Take speech recognition. If humans transcribing an audio clip were to label the same speech disfluency “um” (a U.S. version) 70 percent of the time and “erm” (a U.K. variation) 30 percent of the time, then HLP would be low. Two randomly chosen labelers would agree only 58 percent of the time (0.72 + 0.33). An A.I. model could gain a statistical advantage by picking “um” all of the time, which would be consistent with 70 percent of the time with the human-supplied label. Thus, the A.I. would beat HLP without being more accurate in a way that matters.

The scenario that Andrew draws out doesn’t seem very plausible. But the broader point about thinking hard about cases which humans are not able to label consistently is an important one and worth building systems around.

## Too Much Churn: Estimating Customer Churn

18 Nov

A new paper uses financial transaction data to estimate customer churn in consumer-facing companies. The paper defines churn as follows:

There are three concerns with the definition:

1. The definition doesn’t make clear what is the normalizing constant for calculating the share. Given that the value “can vary between zero and one,” presumably the normalizing constant is either a) total revenue in the same year in which customer buys products, b) total revenue in the year in which the firm revenue was greater.
2. If the denominator when calculating s_fit is the total revenue in the same year in which the customer buys products from the company, it can create a problem. Consider a case where there is a customer that spends $10 in both year t and year t-k. And assume that the firm’s revenue in the same years is$10 and \$20 respectively. In this case, the customer hasn’t changed his/her behavior but their share has gone from 1 to .5.
3. Beyond this, there is a semantic point. Churn is generally used to refer to attrition. In this case, it covers both customer acquisition and attrition. It also covers both a reduction and an increase in customer spending.

A Fun Aside

“Netflix similarly was not in one of our focused consumer-facing industries according to our SIC classification (it is found with two-digit SIC of 78, which mostly contains movie producers)” — this tracks with my judgment of Netflix.

## 94.5% Certain That Covid Vaccine Will Be Less Than 94.5% Effective

16 Nov

“On Sunday, an independent monitoring board broke the code to examine 95 infections that were recorded starting two weeks after volunteers’ second dose — and discovered all but five illnesses occurred in participants who got the placebo.”

Moderna Says Its COVID-19 Vaccine Is 94.5% Effective In Early Tests

The data = control group is 5 out of 15k and the treatment group is 90 out of 15k. The base rate (control group) is .6%. When the base rate is so low, it is generally hard to be confident about the ratio (1 – (5/95)). But noise is not the same as bias. One reason to think why 94.5% is an overestimate is simply because 94.5% is pretty close to the maximum point on the scale.

The other reason to worry about 94.5% is that the efficacy of a Flu vaccine is dramatically lower. (There is a difference in the time horizons over which effectiveness is measured for Flu for Covid, with Covid being much shorter, but useful to take that as a caveat when trying to project the effectiveness of Covid vaccine.)

## Fat Or Not: Toward ‘Proper Training of DL Models’

16 Nov

A new paper introduces a DL model to enable ‘computer aided diagnosis of obesity.’ Some concerns:

1. Better baselines: BMI is easy to calculate and it would be useful to compare the results to BMI.
2. Incorrect statement: The authors write: “the data partition in all the image sets are balanced with 50 % normal classes and 50 % obese classes for proper training of the deep learning models.” (This ought not to affect the results reported in the paper.)
3. Ignoring Within Person Correlation: The paper uses data from 100 people (50 fat, 50 healthy) and takes 647 images of them (310 obese). It then uses data augmentation to expand the dataset to 2.7k images. But in doing the train/test split, there is no mention of splitting by people, which is the right thing to do.

Start with the fact that you won’t see the people in your training data again when you put the model in production. If you don’t split train/test by people, it means that the images of the people in the training set are also in the test set. This means that the test set accuracy is likely higher than if you would run it on a fresh sample.

## Not So Robust: The Limitations of “Doubly Robust” ATE Estimators

16 Nov

Doubly Robust (DR) estimators of ATE are all the rage. One popular DR estimator is Robins’ Augmented IPW (AIPW). The reason why Robins’ AIPW estimator is called doubly robust is that if either your IPW model or your y ~ x model is correctly specified, you get ATE. Great!

Calling something “doubly robust” (DR) makes you think that the estimator is robust to (common) violations of commonly made assumptions. But DR replaces one strong assumption with one marginally less strong assumption. It is common to assume that IPW or Y ~ X are right. But DR replaces either of these with the OR clause. So how common is it to get either of the models right? Basically never. If neither model is right, you multiply the bias terms. And that ought to blow up the bias.

(There is one more reason to worry about the use of word ‘robust.’ In statistics, it is used to convey robustness of to violations of distributional assumptions.)

Given the small advance in assumptions, it turns out that the results aren’t better either (and can be substantially worse):

1. “None of the DR methods we tried … improved upon the performance of simple regression-based prediction of the missing values. (see here.)
2. “The methods with by far the worst performance with regard to RSMSE are the Doubly Robust (DR) approaches, whose RSMSE is two or three times as large as the RSMSE for the other estimators.” (see here and the relevant table is included below.)

Some people prefer DR for efficiency. But the claim for efficiency is based on strong assumptions being met: “The local semiparametric efficiency property, which guarantees that the solution to (9) is the best estimator within its class, was derived under the assumption that both models are correct. This estimate is indeed highly efficient when the π-model is true and the y-model is highly predictive.”

p.s. When I went through some of the lecture notes posted online, I was surprised that the lecture notes explain DR as “if A or B hold, we get ATE” but do not discuss the modal case.

## Instrumental Music: When It Rains, It Pours

23 Oct

In a new paper, Jon Mellon reviews 185 papers that use weather as an instrument and finds that researchers have linked 137 variables to weather. You can read it as each paper needing to contend with 136 violations of the exclusion restriction, but the situation is likely less dire. For one, weather as an instrument has many varietals. Some papers use local (both in time and space) fluctuations in the weather for identification. At the other end, some use long-range (both in time and space) variations in weather, e.g., those wrought upon by climate. And the variables affected by each are very different. For instance, we don’t expect long-term ‘dietary diversity’ to be affected by short-term fluctuations in the local weather. A lot of the other variables are like that. For two, the weather’s potential pathways to the dependent variable of interest are often limited. For instance, as Jon notes, it is hard to imagine how rain on election day would affect government spending any other way except its effect on the election outcome.

There are, however, some potential general mechanisms through which exclusion restriction could be violated. The first that Jon identifies is also among the oldest conjecture in social science research—weather’s effect on mood. Except that studies that purport to show the effect of weather on mood are themselves subject to selective response, e.g., when the weather is bad, more people are likely to be home, etc.

There are some other more fundamental concerns with using weather as an instrument. First, when there are no clear answers on how an instrument should be (ahem!) instrumented, the first stage of IV is ripe for specification search. In such cases, people probably pick up the formulation that gives the largest F-stat. Weather falls firmly in this camp. For instance, there is a measurement issue about how to measure rain. Should it be the amount of rain or the duration of rain, or something else? And then there is a crudeness issue of the instrument as ideally, we would like to measure rain over every small geographic unit (of time and space). To create a summary measure from crude observations, we often need to make judgments, and it is plausible that judgments that lead to a larger F-stat. are seen as ‘better.’

Second, for instruments that are correlated in time, we need to often make judgments to regress out longer-term correlations. For instance, as Jon points out, studies that estimate the effect of rain on voting on election day may control long-term weather but not ‘medium term.’ “However, even short-term studies will be vulnerable to other mechanisms acting at time periods not controlled for. For instance, many turnout IV studies control for the average weather on that day of the year over the previous decade. However, this does not account for the fact that the weather on election day will be correlated with the weather over the past week or month in that area. This means that medium-term weather effects will still potentially confound short-term studies.”

The concern is wider and includes some of the RD designs that measure the effect of ad exposure on voting, etc.

## Distance Function For Matched Randomization

6 Oct

What is the right distance function for creating greater balance before we randomize? One way to do it is not to think too much about the distance function at all. For instance, this paper takes all the variables, treats them as the same (you can do normalization if you want to) and you can use Mahalanobis distance, or what have you. There are two decisions here: about the subspace and about the weights.

Surprisingly, these ad hoc choices don’t have serious pitfalls except that the balance we get finally in the Y_0 (which is the quantity of interest) may not be great. There is also one case where the method will fail. The point is best illustrated with a contrived example. Imagine there is just one observed X and let’s say it is pure noise. If we were to match on noise and then randomize, it will be the case that it will increase the imbalance in the Y_0 half the time and decrease it another half the time. In all, the benefit of spending a lot of energy on improving balance in an ad hoc space, which may or may not help the true objective function, is likely overstated.

If we have a baseline survey and baseline Ys and we assume that Y_lagged predicts Y_0, then the optimal strategy would be to match on lagged Y. If we have multiple time periods for which we have surveys, we can build a supervised learning model to predict Y in the next time period and match on the Y_hat. The same logic applies when we don’t have lagged_Y for all the rows. We can impute them with supervised learning.

## Unmatched: The Problem With Comparing Matching Methods

5 Oct

In many matching papers, the key claim proceeds as follows: our matching method is better than others because on this set of contrived data, treatment effect estimates are closest to those from the ‘gold standard’ (experimental evidence).

Let’s side-step concerns related to an important point: evidence that a method works better than other methods on some data is hard to interpret as we do not know if the fact generalizes. Ideally, we want to understand the circumstances in which the method works better than other methods. If the claim is that the method always works better, then prove it.

There is a more fundamental concern here. Matching changes the estimand by pruning some of the data as it takes out regions with low support. But the regions that are taken out vary by the matching method. So, technically the estimands that rely on different matching methods are different—treatment effect over different sets of rows. And if the estimate from method X comes closer to the gold standard than the estimate from method Y, it may be because the set of rows method X selects produce a treatment effect that is closer to the gold standard. It doesn’t however mean that method X’s inference on the set of rows it selects is the best. (And we do not know how the estimate technically relates to the ATE.)

## Optimal Recruitment For Experiments: Using Pair-Wise Matching Distance to Guide Recruitment

4 Oct

Pairwise matching before randomization reduces s.e. (see here, for instance). Generally, the strategy is used to create balanced control and treatment groups from available observations. But we can use the insight for optimal sample recruitment especially in cases where we have a large panel of respondents with baseline data, like YouGov. The algorithm is similar to what YouGov already uses, except it is tailored to experiments: