ML (O)Ops! Keeping Track of Changes (Part 2)

22 Mar

The first part of the series, “Improving and Deploying On-Device Models With Confidence”, is posted here.

With Atul Dhingra

One way to automate classification is to compare new instances to a known list and plug-in the majority class of the exact match. For such instance-based learning, you often don’t need to version data; you just need a hash table. When you are not relying on an exact match—most machine learning—you often need to version data to reproduce the behavior.

Reproducibility is the bedrock of mature software engineering. It is fundamental because it allows you to diagnose and backtrace issues. You can reproduce the behavior of a ‘version.’ With that power, you can correlate changes in inputs with changes in outputs. Systems that enable reproducibility, like version control, have another vital purpose—reducing risk stemming from changes and allow regression testing in systems that depend on data, such as ML. They reduce it by allowing for changes to be rolled back. 

To reproduce outputs from machine learning models, we need to do more than store data. We also need to store hyper-parameters, details about the OS, programming language, and packages, among other things. But given the primary value of reproducibility is instrumental—diagnosis—we not just want the ability to reproduce but also the ability to understand changes and correlate them. Current solutions miss the mark.

Current Solutions and Problems

One way to version data is to treat it as a binary blob. Store all the data you learned a model on to a server and store a reference to the data in your repository. If the data changes, store the new version and create a new pointer. One downside of using a <code>git lfs</code> like mechanism is that your storage blows up. Another is that build times can be large if the local cache is small or more generally if access costs are large. Yet another problem is the lack of a neat interface that allows you to track more than source data. 

DVC purports to solve all three problems. It solves the first by providing a way to not treat the data as a blob. For instance, in a computer vision workflow, the source data is image files with some elementary tags—labels, assignments to train and test, etc. The differences between data versions are 1) changes in images (additions mostly) and 2) changes in mapping to labels and assignments. DVC allows you to store the differences in corpora of images as a list of additional hashes to files. DVC is silent on the second point—efficient storage of changes in mappings. We come to it later. DVC purports to solve the second problem by allowing you to save to local cloud storage. But it can still be time-consuming to download data from the cloud storage buckets. The reason is as follows. Each time you want to work on an experiment, you need to clone the entire cache to check out the appropriate files. And if not handled properly, the cloning time often significantly exceeds typical training times. Worse, it locks you into a cloud provider for any optimizations you may want to alleviate this time bound cache downloads. DVC purports to solve the last problem by using yaml, tags, etc. But anarchy prevails. 

Future Solutions

Interpretable Changes

One of the big problems with data versioning is that the diffs are not human-readable, much less comprehensible. The diffs are usually very long, and the changes in the diff are hashes, which means that to review an MR/PR/Diff, the reviewer has to check out the change and pull the data with the updated hashes. The process can be easily improved by adding an extra layer that auto-summarizes the changes into a human-readable form. We can, of course, easily do more. We can provide ways to understand how changes to inputs correlate with changes in outputs.

Diff. Tables

The standard method of understanding data as a blob seems uniquely bad. For conventional rectangular databases, changes can be understood as changes in functional transformations of core data lake tables. For instance, say we store the label assignments of images in a table. And say we revise the labels of 100 images. (The core data lake tables are immutable, so the changes are executed in the downstream tables.) One conventional way of storing the changes is to use a separate table for recording changes. Another is to write an update statement that is run whenever “the v2” table is generated. This means the differences across data are now tied to a data transformation computation graph. When data transformation is inexpensive, we can delay running the transformations till the table is requested. In other cases, we can cache the tables.

ML (O)Ops! Improving and Deploying On-Device Models With Confidence (Part 1)

21 Feb

With Atul Dhingra.

Part 1 of a multi-part series.

It is well known that ML Engineers today spend most of their time doing things that do not have a lot to do with machine learning. They spend time working on technically unsophisticated but important things like deployment of models, keeping track of experiments, etc.—operations. Atul and I dive into the reasons behind the status quo and propose solutions, starting with issues to do with on-device deployments. 

Performance on Device

The deployment of on-device models is complicated by the fact that the infrastructure used for training is different from what is used for production. This leads to many tedious rollbacks. 

The underlying problem is missing data. We are missing data on the latency in prediction, which is a function of i/o latency and time taken to compute. One way to impute the missing data is to build a model that predicts latency based on various features of the deployed model. Given many companies have gone through thousands of deployments and rollbacks, there is rich data to learn from. Another is to directly measure the time with ‘shadow deployments—performance on redundant chips colocated with the production chip and getting exactly the same data at about the same time (a small lag in passing on the data to the redundant chips is just fine as we can start the clock at a different time).  

Predicting latency given a model and deployment architecture solves the problem of deploying reliably. It doesn’t solve the problem of how to improve the performance of the system given a model. To improve the production performance of ML systems, companies need to analyze the data, e.g., compute the correlation between load on the edge server and latency, and generate additional data by experimenting with various easily modifiable parts of the system, e.g., increasing capacity of the edge server, etc. (If you are a cloud service provider like AWS, you can learn from all the combinations of infrastructure that exist to predict latency for various architectures given a model and propose solutions to the customer.)

There is plausibly also a need for a service that helps companies decide which chip is optimal for deployment. One solution to the problem is as a service— a service that provides data on the latency of a model on different chips. 

Build Software for the Lay User

14 Feb

Most word processing software helpfully point out grammatical errors and spelling mistakes. Some even autocorrect. And some, like Grammarly, even give style advice. 

Now consider software used for business statistics. Say you want to compute the correlation between two vectors: [100, 2000, 300, 400, 500, 600] and [1, 2, 3, 4, 5, 17000]. Most (all?) software will output .65. (The software assume you want Pearson’s correlation.) Experts know that the relatively large value in the second vector has a large influence on the correlation. For instance, switching it to -17000 will reverse the correlation coefficient to -.65. And if you remove the last observation, the correlation is 1. But a lay user would be none the wiser. Common software, e.g., Excel, R, Stata, Google Sheets, etc., do not warn the user about the outlier and its potential impact on the result. They should.

Take another example—the fickleness of the interpretation of AUC when you have binary predictors (see here) as much depends on how you treat ties. It is an obvious but subtle point. Commonly used statistical software, however, do not warn people about the issue.

Given the rate of increase in the production of knowledge, increasingly everyone is a lay user. For instance, in 2013, Lin showed that estimating ATE using OLS with a full set of interactions improves the precision of ATE. But such analyses are uncommon in economics papers. The analysis could be absent for a variety of reasons: 1. ignorance, 2. difficulty in estimating the model, 3. do not believe the result, etc. However, only ignorance stands the scrutiny. The model is easy to estimate, so the second explanation is unlikely to explain much. The last explanation also seems unlikely, given the result was published in a prominent statistical journal and experts use it.

If ignorance is the primary explanation, should the onus of being well informed about the latest useful discoveries in methods fall on researchers working in a substantive area? Plausibly. But that is clearly not working very well. One way to accelerate the dissemination of useful discoveries is via software, where you can provide such guidance as ‘warnings.’ 

The guidance can be put in manually. Or we can use machine learning, exploiting the strategy used by Grammarly, which uses expert editors to edit lay user sentences and uses that as training data.

We can improve science by building software that provides better guidance. The worst-case for such software is probably business-as-usual, where some researchers get bad advice and many get no advice.

Superhuman: Can ML Beat Human-Level Performance in Supervised Models?

20 Dec

A supervised model cannot do better than its labels. (I revisit this point later.) So the trick is to make labels as good as you can. The errors in labels stem from three sources: 

  1. Lack of Effort: More effort people spend labeling something, presumably the more accurate it will be.
  2. Unclear Directions: Unclear directions can result from a. poorly written directions, b. conceptual issues, c. poor understanding. Let’s tackle conceptual issues first. Say you are labeling the topic of news articles. Say you come across an article about how Hillary Clinton’s hairstyle has evolved over the years. Should it be labeled as politics, or should it labeled as entertainment (or my preferred label: worthless)? It depends on taste and the use case. Whatever the decision, it needs to be codified (and clarified) in the directions given to labelers. Poor writing is generally a result of inadequate effort.  
  3. Hardness: Is that a 4 or a 7? We have all suffered at the hands of CAPTCHA to know that some tasks are harder than others.   

The fix for the first problem is obvious. To increase effort, incentivize. Incentivize by paying for correctness—measured over known-knowns—or by penalizing mistakes. And by providing feedback to people on the money they lost or how much more others with a better record made.

Solutions for unclear directions vary by the underlying problem. To address conceptual issues, incentivize people to flag (and comment on) cases where the directions are unclear and build a system to collect and review prediction errors. To figure out if the directions are unclear, quiz people on comprehension and archetypal cases. 

Can ML Performance Be Better Than Humans?

If humans label the dataset, can ML be better than humans? The first sentence of the article suggests not. Of course, we have yet to define what humans are doing. If the benchmark is labels provided by a poorly motivated and trained workforce and the model is trained on labels provided by motivated and trained people, ML can do better. The consensus label provided by a group of people will also generally be less noisy than one provided by a single person.    

Andrew Ng brings up another funny way ML can beat humans—by not learning from human labels very well. 

When training examples are labeled inconsistently, an A.I. that beats HLP on the test set might not actually perform better than humans in practice. Take speech recognition. If humans transcribing an audio clip were to label the same speech disfluency “um” (a U.S. version) 70 percent of the time and “erm” (a U.K. variation) 30 percent of the time, then HLP would be low. Two randomly chosen labelers would agree only 58 percent of the time (0.72 + 0.33). An A.I. model could gain a statistical advantage by picking “um” all of the time, which would be consistent with 70 percent of the time with the human-supplied label. Thus, the A.I. would beat HLP without being more accurate in a way that matters.

The scenario that Andrew draws out doesn’t seem very plausible. But the broader point about thinking hard about cases which humans are not able to label consistently is an important one and worth building systems around.

Too Much Churn: Estimating Customer Churn

18 Nov

A new paper uses financial transaction data to estimate customer churn in consumer-facing companies. The paper defines churn as follows:

There are three concerns with the definition:

  1. The definition doesn’t make clear what is the normalizing constant for calculating the share. Given that the value “can vary between zero and one,” presumably the normalizing constant is either a) total revenue in the same year in which customer buys products, b) total revenue in the year in which the firm revenue was greater.
  2. If the denominator when calculating s_fit is the total revenue in the same year in which the customer buys products from the company, it can create a problem. Consider a case where there is a customer that spends $10 in both year t and year t-k. And assume that the firm’s revenue in the same years is $10 and $20 respectively. In this case, the customer hasn’t changed his/her behavior but their share has gone from 1 to .5.
  3. Beyond this, there is a semantic point. Churn is generally used to refer to attrition. In this case, it covers both customer acquisition and attrition. It also covers both a reduction and an increase in customer spending.

A Fun Aside

“Netflix similarly was not in one of our focused consumer-facing industries according to our SIC classification (it is found with two-digit SIC of 78, which mostly contains movie producers)” — this tracks with my judgment of Netflix.

94.5% Certain That Covid Vaccine Will Be Less Than 94.5% Effective

16 Nov

“On Sunday, an independent monitoring board broke the code to examine 95 infections that were recorded starting two weeks after volunteers’ second dose — and discovered all but five illnesses occurred in participants who got the placebo.”

Moderna Says Its COVID-19 Vaccine Is 94.5% Effective In Early Tests

The data = control group is 5 out of 15k and the treatment group is 90 out of 15k. The base rate (control group) is .6%. When the base rate is so low, it is generally hard to be confident about the ratio (1 – (5/95)). But noise is not the same as bias. One reason to think why 94.5% is an overestimate is simply because 94.5% is pretty close to the maximum point on the scale.

The other reason to worry about 94.5% is that the efficacy of a Flu vaccine is dramatically lower. (There is a difference in the time horizons over which effectiveness is measured for Flu for Covid, with Covid being much shorter, but useful to take that as a caveat when trying to project the effectiveness of Covid vaccine.)

Fat Or Not: Toward ‘Proper Training of DL Models’

16 Nov

A new paper introduces a DL model to enable ‘computer aided diagnosis of obesity.’ Some concerns:

  1. Better baselines: BMI is easy to calculate and it would be useful to compare the results to BMI.
  2. Incorrect statement: The authors write: “the data partition in all the image sets are balanced with 50 % normal classes and 50 % obese classes for proper training of the deep learning models.” (This ought not to affect the results reported in the paper.)
  3. Ignoring Within Person Correlation: The paper uses data from 100 people (50 fat, 50 healthy) and takes 647 images of them (310 obese). It then uses data augmentation to expand the dataset to 2.7k images. But in doing the train/test split, there is no mention of splitting by people, which is the right thing to do.

    Start with the fact that you won’t see the people in your training data again when you put the model in production. If you don’t split train/test by people, it means that the images of the people in the training set are also in the test set. This means that the test set accuracy is likely higher than if you would run it on a fresh sample.

Not So Robust: The Limitations of “Doubly Robust” ATE Estimators

16 Nov

Doubly Robust (DR) estimators of ATE are all the rage. One popular DR estimator is Robins’ Augmented IPW (AIPW). The reason why Robins’ AIPW estimator is called doubly robust is that if either your IPW model or your y ~ x model is correctly specified, you get ATE. Great!

Calling something “doubly robust” (DR) makes you think that the estimator is robust to (common) violations of commonly made assumptions. But DR replaces one strong assumption with one marginally less strong assumption. It is common to assume that IPW or Y ~ X are right. But DR replaces either of these with the OR clause. So how common is it to get either of the models right? Basically never. If neither model is right, you multiply the bias terms. And that ought to blow up the bias.

(There is one more reason to worry about the use of word ‘robust.’ In statistics, it is used to convey robustness of to violations of distributional assumptions.)

Given the small advance in assumptions, it turns out that the results aren’t better either (and can be substantially worse):

  1. “None of the DR methods we tried … improved upon the performance of simple regression-based prediction of the missing values. (see here.)
  2. “The methods with by far the worst performance with regard to RSMSE are the Doubly Robust (DR) approaches, whose RSMSE is two or three times as large as the RSMSE for the other estimators.” (see here and the relevant table is included below.)
From Kern et al. 2016

Some people prefer DR for efficiency. But the claim for efficiency is based on strong assumptions being met: “The local semiparametric efficiency property, which guarantees that the solution to (9) is the best estimator within its class, was derived under the assumption that both models are correct. This estimate is indeed highly efficient when the π-model is true and the y-model is highly predictive.”

p.s. When I went through some of the lecture notes posted online, I was surprised that the lecture notes explain DR as “if A or B hold, we get ATE” but do not discuss the modal case.

Instrumental Music: When It Rains, It Pours

23 Oct

In a new paper, Jon Mellon reviews 185 papers that use weather as an instrument and finds that researchers have linked 137 variables to weather. You can read it as each paper needing to contend with 136 violations of the exclusion restriction, but the situation is likely less dire. For one, weather as an instrument has many varietals. Some papers use local (both in time and space) fluctuations in the weather for identification. At the other end, some use long-range (both in time and space) variations in weather, e.g., those wrought upon by climate. And the variables affected by each are very different. For instance, we don’t expect long-term ‘dietary diversity’ to be affected by short-term fluctuations in the local weather. A lot of the other variables are like that. For two, the weather’s potential pathways to the dependent variable of interest are often limited. For instance, as Jon notes, it is hard to imagine how rain on election day would affect government spending any other way except its effect on the election outcome. 

There are, however, some potential general mechanisms through which exclusion restriction could be violated. The first that Jon identifies is also among the oldest conjecture in social science research—weather’s effect on mood. Except that studies that purport to show the effect of weather on mood are themselves subject to selective response, e.g., when the weather is bad, more people are likely to be home, etc. 

There are some other more fundamental concerns with using weather as an instrument. First, when there are no clear answers on how an instrument should be (ahem!) instrumented, the first stage of IV is ripe for specification search. In such cases, people probably pick up the formulation that gives the largest F-stat. Weather falls firmly in this camp. For instance, there is a measurement issue about how to measure rain. Should it be the amount of rain or the duration of rain, or something else? And then there is a crudeness issue of the instrument as ideally, we would like to measure rain over every small geographic unit (of time and space). To create a summary measure from crude observations, we often need to make judgments, and it is plausible that judgments that lead to a larger F-stat. are seen as ‘better.’

Second, for instruments that are correlated in time, we need to often make judgments to regress out longer-term correlations. For instance, as Jon points out, studies that estimate the effect of rain on voting on election day may control long-term weather but not ‘medium term.’ “However, even short-term studies will be vulnerable to other mechanisms acting at time periods not controlled for. For instance, many turnout IV studies control for the average weather on that day of the year over the previous decade. However, this does not account for the fact that the weather on election day will be correlated with the weather over the past week or month in that area. This means that medium-term weather effects will still potentially confound short-term studies.”

The concern is wider and includes some of the RD designs that measure the effect of ad exposure on voting, etc.

Distance Function For Matched Randomization

6 Oct

What is the right distance function for creating greater balance before we randomize? One way to do it is not to think too much about the distance function at all. For instance, this paper takes all the variables, treats them as the same (you can do normalization if you want to) and you can use Mahalanobis distance, or what have you. There are two decisions here: about the subspace and about the weights.

Surprisingly, these ad hoc choices don’t have serious pitfalls except that the balance we get finally in the Y_0 (which is the quantity of interest) may not be great. There is also one case where the method will fail. The point is best illustrated with a contrived example. Imagine there is just one observed X and let’s say it is pure noise. If we were to match on noise and then randomize, it will be the case that it will increase the imbalance in the Y_0 half the time and decrease it another half the time. In all, the benefit of spending a lot of energy on improving balance in an ad hoc space, which may or may not help the true objective function, is likely overstated.

If we have a baseline survey and baseline Ys and we assume that Y_lagged predicts Y_0, then the optimal strategy would be to match on lagged Y. If we have multiple time periods for which we have surveys, we can build a supervised learning model to predict Y in the next time period and match on the Y_hat. The same logic applies when we don’t have lagged_Y for all the rows. We can impute them with supervised learning.

Unmatched: The Problem With Comparing Matching Methods

5 Oct

In many matching papers, the key claim proceeds as follows: our matching method is better than others because on this set of contrived data, treatment effect estimates are closest to those from the ‘gold standard’ (experimental evidence).

Let’s side-step concerns related to an important point: evidence that a method works better than other methods on some data is hard to interpret as we do not know if the fact generalizes. Ideally, we want to understand the circumstances in which the method works better than other methods. If the claim is that the method always works better, then prove it.

There is a more fundamental concern here. Matching changes the estimand by pruning some of the data as it takes out regions with low support. But the regions that are taken out vary by the matching method. So, technically the estimands that rely on different matching methods are different—treatment effect over different sets of rows. And if the estimate from method X comes closer to the gold standard than the estimate from method Y, it may be because the set of rows method X selects produce a treatment effect that is closer to the gold standard. It doesn’t however mean that method X’s inference on the set of rows it selects is the best. (And we do not know how the estimate technically relates to the ATE.)

Optimal Recruitment For Experiments: Using Pair-Wise Matching Distance to Guide Recruitment

4 Oct

Pairwise matching before randomization reduces s.e. (see here, for instance). Generally, the strategy is used to create balanced control and treatment groups from available observations. But we can use the insight for optimal sample recruitment especially in cases where we have a large panel of respondents with baseline data, like YouGov. The algorithm is similar to what YouGov already uses, except it is tailored to experiments:

  1. Start with a random sample.
  2. Come up with optimal pairs based on whatever criteria you have chosen.
  3. Reverse sort pairs by distance with the pairs with the largest distance at the top.
  4. Find the best match in the rest of the panel file for one of the randomly chosen points in the pair. (If you have multiple equivalent matches, pick one at random.)
  5. Proceed as far down the list as needed.

Technically, we can go from step 1 to step 4 if we choose a random sample that is half the size we want for the experiment. We just need to find the best matching pair for each respondent.

Another ANES Goof-em-up: VCF0731

30 Aug

By Rob Lytle

At this point, it’s well established that the ANES CDF’s codebook is not to be trusted (I’m repeating “not to be trusted to include a second link!). Recently, I stumbled across another example of incorrect coding in the cumulative data file, this time in VCF0731 – Do you ever discuss politics with your family or friends?

The codebook reports 5 levels:

Do you ever discuss politics with your family or friends?

1. Yes
5. No

8. DK
9. NA

INAP. question not used

However, when we load the variable and examine the unique values:

# pulling anes-cdf from a GitHub repository
cdf <- rio::import("")

## [1] NA  5  1  6  7

We see a completely different coding scheme. We are left adrift, wondering “What is 6? What is 7?” Do 1 and 5 really mean “yes” and “no”?

We may never know.

For a survey that costs several million dollars to conduct, you’d think we could expect a double-checked codebook (or at least some kind of version control to easily fix these things as they’re identified).

AFib: Apple Watch Did Not Increase Atrial Fibrillation Diagnoses

28 Aug

A new paper purportedly shows that the release of Apple Watch 2018 which supported ECG app did not cause an increase in AFib diagnoses (mean = −0.008). 

They make the claim based on 60M visits from and 1270 practices across 2 years.

Here are some things to think about:

  1. Expected effect size. Say the base AF rate as .41%. Let’s say 10% has the ECG app + Apple watch. (You have to make some assumptions about how quickly people downloaded the app. I am making a generous assumption that 10% do it the day of release.) For the 10%, say it is .51%. Add’l diagnoses expected = .01*30M ~ 3k.
  2. Time trend. 2018-19 line is significantly higher (given the baseline) than 2016-2017. It is unlikely to be explained by the aging of the population. Is there a time trend? What explains it? More acutely, diff. in diff. doesn’t account for that.
  3. Choice of the time period. When you have observations over multiple time periods pre-treatment and post-treatment, the inference depends on which time period you use. For instance,  if I do an “ocular distortion test”, the diff. in diff. with observations from Aug./Sep. would suggest a large positive impact. For a more transparent account of assumptions, see (h/t Kyle Foreman).
  4. Clustering of s.e. Some correlation in diagnosis because of facility (doctor) which is unaccounted for.

Survey Experiments With Truth: Learning From Survey Experiments

27 Aug

Tools define science. Not only do they determine how science is practiced but also what questions are asked. Take survey experiments, for example. Since the advent of online survey platforms, which made conducting survey experiments trivial, the lure of convenience and internal validity has persuaded legions of researchers to use survey experiments to understand the world.

Conventional survey experiments are modest tools. Paul Sniderman writes,

“These three limitations of survey experiments—modesty of treatment, modesty of scale, and modesty of measurement—need constantly to be borne in mind when brandishing term experiment as a prestige enhancer.” I think we can easily collapse these in two — treatment (which includes ‘scale’ as he defines it— the amount of time) and measurement.

Paul Sniderman

Note: We can collapse these three concerns into two— treatment (which includes ‘scale’ as Paul defines it— the amount of time) and measurement.

But skillful artisans have used this modest tool to great effect. Famously, Kahneman and Tversky used survey experiments, e.g., Asian Disease Problem, to shed light on how people decide. More recently, Paul Sniderman and Tom Piazza have used survey experiments to shed light on an unsavory aspect of human decision making: discrimination. Aside from shedding light on human decision making, researchers have also used survey experiments to understand what survey measures mean, e.g., Ahler and Sood

The good, however, has come with the bad; insight has often come with irreflection. In particular, Paul Sniderman implicitly points to two common mistakes that people make:

  1. Not Learning From the Control Group. The focus on differences in means means that we sometimes fail to reflect on what the data in the Control Group tells us about the world. Take the paper on partisan expressive responding, for instance. The topline from the paper is that expressive responding explains half of the partisan gap. But it misses the bigger story—the partisan differences in the Control Group are much smaller than what people expect, just about 6.5% (see here). (Here’s what I wrote in 2016.)
  2. Not Putting the Effect Size in Context. A focus on significance testing means that we sometimes fail to reflect on the modesty of effect sizes. For instance, providing people $1 for a correct answer within the context of an online survey interview is a large premium. And if providing a dollar each on 12 (included) questions nudges people from an average of 4.5 correct responses to 5, it suggests that people are resistant to learning or impressively confident that what they know is right. Leaving $7 on the table tells us more than the .5, around which the paper is written. 

    More broadly, researchers are obtuse to the point that sometimes what the results show is how impressively modest the movement is when you ratchet up the dosage. For instance, if an overwhelming number of African Americans favor Whites who have scored just a few points more than a Black student, it is a telling testament to their endorsement of meritocracy.

Nothing to See Here: Statistical Power and “Oversight”

13 Aug

“Thus, when we calculate the net degree of expressive responding by subtracting the acceptance effect from the rejection effect—essentially differencing off the baseline effect of the incentive from the reduction in rumor acceptance with payment—we find that the net expressive effect is negative 0.5%—the opposite sign of what we would expect if there was expressive responding. However, the substantive size of the estimate of the expressive effect is trivial. Moreover, the standard error on this estimate is 10.6, meaning the estimate of expressive responding is essentially zero.

(Note: This is not a full review of all the claims in the paper. There is more data in the paper than in the quote above. I am merely using the quote to clarify a couple of statistical points.)

There are two main points:

  1. The fact that estimate is close to zero and the s.e. is super fat are technically unrelated. The last line of the quote, however, seems to draw a relationship between the two.
  2. The estimated effect sizes of expressive responding in the literature are much smaller than the s.e. Bullock et al. (Table 2) estimate the effect of expressive responding at about 4% and Prior et al. (Figure 1) at about ~ 5.5% (“Figure 1(a) shows, the model recovers the raw means from Table 1, indicating a drop in bias from 11.8 to 6.3.”). Thus, one reasonable inference is that the study is underpowered to reasonably detect expected effect sizes.

Casual Inference: Errors in Everyday Causal Inference

12 Aug

Why are things the way they are? What is the effect of something? Both of these reverse and forward causation questions are vital.

When I was at Stanford, I took a class with a pugnacious psychometrician, David Rogosa. David had two pet peeves, one of which was people making causal claims with observational data. And it is in David’s class that I learned the pejorative for such claims. With great relish, David referred to such claims as ‘casual inference.’ (Since then, I have come up with another pejorative phrase for such claims—cosal inference—as in merely dressing up as causal inference.)

It turns out that despite its limitations, casual inference is quite common. Here are some fashionable costumes:

  1. 7 Habits of Successful People: We have all seen business books with such titles. The underlying message of these books is: adopt these habits, and you will be successful too! Let’s follow the reasoning and see where it falls apart. One stereotype about successful people is that they wake up early. And the implication is you wake up early you can be successful too. It *seems* right. It agrees with folk wisdom that discomfort causes success. But can we reliably draw inferences about what less successful people should do based on what successful people do? No. For one, we know nothing about the habits of less successful people. It could be that less successful people wake up *earlier* than the more successful people. Certainly, growing up in India, I recall daily laborers waking up much earlier than people living in bungalows. And when you think of it, the claim that servants wake up before masters seems uncontroversial. It may even be routine enough to be canonized as a law—the Downtown Abbey law. The upshot is that when you select on the dependent variable, i.e., only look at cases where the variable takes certain values, e.g., only look at the habits of financially successful people, even correlation is not guaranteed. This means that you don’t even get to mock the claim with the jibe that “correlation is not causation.”

    Let’s go back to Goji’s delivery service for another example. One of the ‘tricks’ that we had discussed was to sample failures. If you do that, you are selecting on the dependent variable. And while it is a good heuristic, it can lead you astray. For instance, let’s say that most of the late deliveries our early morning deliveries. You may infer that delivering at another time may improve outcomes. Except, when you look at the data, you find that the bulk of your deliveries are in the morning. And the rate at which deliveries run late is *lower* early morning than during other times.

    There is a yet more famous example of things going awry when you select on the dependent variable. During World War II, statisticians were asked where the armor should be added on the planes. Of the aircraft that returned, the damage was concentrated in a few areas, like the wings. The top-of-head answer is to suggest we reinforce areas hit most often. But if you think about the planes that didn’t return, you get to the right answer, which is that we need to reinforce areas that weren’t hit. In literature, people call this kind of error, survivorship bias. But it is a problem of selecting on the dependent variable (whether or not a plane returned) and selecting on planes that returned.

  2. More frequent system crashes cause people to renew their software license. It is a mistake to treat correlation as causation. There are many different reasons behind why doing so can lead you astray. The rarest reason is that lots of odd things are correlated in the world because of luck alone. The point is hilariously illustrated by a set of graphs showing a large correlation between conceptually unrelated things, e.g., there is a large correlation between total worldwide non-commercial space launches and the number of sociology doctorates that are awarded each year.

    A more common scenario is illustrated by the example in the title of this point. Commonly, there is a ‘lurking’ or ‘confounding’ variable that explains both sides. In our case, the more frequently a person uses a system, the more the number of crashes. And it makes sense that people who use the system most frequently also need the software the most and renew the license most often.

    Another common but more subtle reason is called Simpson’s paradox. Sometimes the correlation you see is “wrong.” You may see a correlation in the aggregate, but the correlation runs the opposite way when you break it down by group. Gender bias in U.C. Berkeley admissions provides a famous example. In 1973, 44% of the men who applied to graduate programs were admitted, whereas only 35% of the women were. But when you split by department, which eventually controlled admissions, women generally had a higher batting average than men. The reason for the reversal was women applied more often to more competitive departments, like—-wait for it—-English and men were more likely to apply to less competitive departments like Engineering. None of this is to say that there isn’t bias against women. It is merely to point out that the pattern in aggregated data may not hold when you split the data into relevant chunks.

    It is also important to keep in mind the opposite of correlation is not causation—lack of correlation does not imply a lack of causation.

  3. Mayor Giuliani brought the NYC crime rate down. There are two potential errors here:
    • Forgetting about ecological trends. Crime rates in other big US cities went down at the same time as they did in NY, sometimes more steeply. When faced with a causal claim, it is good to check how ‘similar’ people fared. The Difference-in-Differences estimator that builds on this intuition.
    • Treating temporally proximate as causal. Say you had a headache, you took some medicine and your headache went away. It could be the case that your headache went away by itself, as headaches often do.

  4. I took this homeopathic medication and my headache went away. If the ailments are real, placebo effects are a bit mysterious. And mysterious they may be but they are real enough. Not accounting for placebo effects misleads us to ascribe the total effect to the medicine. 

  5. Shallow causation. We ascribe too much weight to immediate causes than to causes that are a few layers deeper.

  6.  Monocausation: In everyday conversations, it is common for people to speak as if x is the only cause of y.

  7.  Big Causation: Another common pitfall is reading x causes y as x causes y to change a lot. This is partly a consequence of mistaking statistical significance with substantive significance, and partly a consequence of us not paying close enough attention to numbers.

  8. Same Effect: Lastly, many people take causal claims to mean that the effect is the same across people. 

Reliable Respondents

23 Jul

Setting aside concerns about sampling, the quality of survey responses on popular survey platforms is abysmal (see here and here). Both insincere and inattentive respondents are at issue. A common strategy for identifying inattentive respondents is to use attention checks. However, many of these attention checks stick out like sore thumbs. The upshot is that an experience respondent can easily spot them. A parallel worry about attention checks is that inexperienced respondents may be confused by them. To address the concerns, we need a new way to identify inattentive respondents. One way to identify such respondents is to measure twice. More precisely, measure immutable or slowly changing traits, e.g., sex, education, etc., twice across closely spaced survey waves. Then, code cases where people switch answers across the waves on such traits as problematic. And then, use survey items, e.g., self-reports and metadata, e.g., survey response time, metadata on IP addresses, etc. in the first survey to predict problematic switches using modern ML techniques that allow variable selection like LASSO (space is at a premium). Assuming the equation holds, future survey creators can use the variables identified by LASSO to identify likely inattentive respondents.     

Self-Recommending: The Origins of Personalization

6 Jul

Recommendation systems are ubiquitous. They determine what videos and news you see, what books and products are ‘suggested’ to you, and much more. If asked about the origins of personalization, my hunch is that some of us will pin it to the advent of the Netflix Prize. Wikipedia goes further back—it puts the first use of the term ‘recommender system’ in 1990. But the history of personalization is much older. It is at least as old as heterogeneous treatment effects (though latent variable models might be a yet more apt starting point). I don’t know for how long we have known about heterogeneous treatment effects but it can be no later than 1957 (Cronbach and Goldine Gleser, 1957).  

Here’s Ed Haertel:

“I remember some years ago when NetFlix founder Reed Hastings sponsored a contest (with a cash prize) for data analysts to come up with improvements to their algorithm for suggesting movies subscribers might like, based on prior viewings. (I don’t remember the details.) A primitive version of the same problem, maybe just a seed of the idea, might be discerned in the old push in educational research to identify “aptitude-treatment interactions” (ATIs). ATI research was predicated on the notion that to make further progress in educational improvement, we needed to stop looking for uniformly better ways to teach, and instead focus on the question of what worked for whom (and under what conditions). Aptitudes were conceived as individual differences in preparation to profit from future learning (of a given sort). The largely debunked notion of “learning styles” like a visual learner, auditory learner, etc., was a naïve example. Treatments referred to alternative ways of delivering instruction. If one could find a disordinal interaction, such that one treatment was optimum for learners in one part of an aptitude continuum and a different treatment was optimum in another region of that continuum, then one would have a basis for differentiating instruction. There are risks with this logic, and there were missteps and misapplications of the idea, of course. Prescribing different courses of instruction for different students based on test scores can easily lead to a tracking system where high performing students are exposed to more content and simply get further and further ahead, for example, leading to a pernicious, self-fulfilling prophecy of failure for those starting out behind. There’s a lot of history behind these ideas. Lee Cronbach proposed the ATI research paradigm in a (to my mind) brilliant presidential address to the American Psychological Association, in 1957. In 1974, he once again addressed the American Psychological Association, on the occasion of receiving a Distinguished Contributions Award, and in effect said the ATI paradigm was worth a try but didn’t work as it had been conceived. (That address was published in 1975.)

This episode reminded me of the “longstanding principle in statistics, which is that, whatever you do, somebody in psychometrics already did it long before. I’ve noticed this a few times.”

Reading Cronbach today is also sobering in a way. It shows how ad hoc the investigation of theories and coming up with the right policy interventions was.

Interacting With Human Decisions

29 Jun

In sport, as in life, luck plays a role. For instance, in cricket, there is a toss at the start of the game. And the team that wins the toss wins the game 3% more often. The estimate of the advantage from winning the toss, however, is likely an underestimate of the maximum potential benefit of winning the toss. The team that wins the toss gets to decide whether to bat or bowl first. And 3% reflects the maximum benefit only when the team that won the toss chooses optimally.

The same point applies to estimates of heterogeneity. Say that you estimate how the probability of winning varies by the decision to bowl or bat first after winning the toss. (The decision to bowl or bat first is made before the toss.) And say, 75% of the time team that wins the toss chooses to bat first and wins these games 55% of the time. 25% of the time, teams decide to bowl first and win about 47% of these games. Winning rates of 55% and 47% would be likely yet higher if the teams chose optimally.

In the absence of other data, heterogeneous treatment effects give clear guidance on where the payoffs are higher. For instance, if you find that showing an ad on Chrome has a larger treatment effect, barring other information (and concerns), you may want to only show ads to people who use Chrome to increase the treatment effect. But the decision to bowl or bat first is not a traditional “covariate.” It is a dummy that captures the human judgment about pre-match observables. The interpretation of the interaction term thus needs care. For instance, in the example above, the winning percentage of 47% for teams that decide to bowl first looks ‘wrong’—how can the team that wins the toss lose more often than win in some cases? Easy. It can happen because the team decides to bowl in cases where the probability of winning is lower than 47%. Or it can be that the team is making a bad decision when opting to bowl first.