Do you ever discuss politics with your family or friends?
INAP. question not used
However, when we load the variable and examine the unique values:
# pulling anes-cdf from a GitHub repository
cdf <- rio::import("https://github.com/RobLytle/intra-party-affect/raw/master/data/raw/cdf-raw-trim.rds")
##  NA 5 1 6 7
We see a completely different coding scheme. We are left adrift, wondering “What is 6? What is 7?” Do 1 and 5really mean “yes” and “no”?
We may never know.
For a survey that costs several million dollars to conduct, you’d think we could expect a double-checked codebook (or at least some kind of version control to easily fix these things as they’re identified).
A new paper purportedly shows that the release of Apple Watch 2018 which supported ECG app did not cause an increase in AFib diagnoses (mean = −0.008).
They make the claim based on 60M visits from and 1270 practices across 2 years.
Here are some things to think about:
Expected effect size. Say the base AF rate as .41%. Let’s say 10% has the ECG app + Apple watch. (You have to make some assumptions about how quickly people downloaded the app. I am making a generous assumption that 10% do it the day of release.) For the 10%, say it is .51%. Add’l diagnoses expected = .01*30M ~ 3k.
Time trend. 2018-19 line is significantly higher (given the baseline) than 2016-2017. It is unlikely to be explained by the aging of the population. Is there a time trend? What explains it? More acutely, diff. in diff. doesn’t account for that.
Choice of the time period. When you have observations over multiple time periods pre-treatment and post-treatment, the inference depends on which time period you use. For instance, if I do an “ocular distortion test”, the diff. in diff. with observations from Aug./Sep. would suggest a large positive impact. For a more transparent account of assumptions, see diff.healthpolicydatascience.org (h/t Kyle Foreman).
Clustering of s.e. Some correlation in diagnosis because of facility (doctor) which is unaccounted for.
Tools define science. Not only do they determine how science is practiced but also what questions are asked. Take survey experiments, for example. Since the advent of online survey platforms, which made conducting survey experiments trivial, the lure of convenience and internal validity has persuaded legions of researchers to use survey experiments to understand the world.
Conventional survey experiments are modest tools. Paul Sniderman writes,
“These three limitations of survey experiments—modesty of treatment, modesty of scale, and modesty of measurement—need constantly to be borne in mind when brandishing term experiment as a prestige enhancer.” I think we can easily collapse these in two — treatment (which includes ‘scale’ as he defines it— the amount of time) and measurement.
Note: We can collapse these three concerns into two— treatment (which includes ‘scale’ as Paul defines it— the amount of time) and measurement.
Not Learning From the Control Group. The focus on differences in means means that we sometimes fail to reflect on what the data in the Control Group tells us about the world. Take the paper on partisan expressive responding, for instance. The topline from the paper is that expressive responding explains half of the partisan gap. But it misses the bigger story—the partisan differences in the Control Group are much smaller than what people expect, just about 6.5% (see here). (Here’s what I wrote in 2016.)
Not Putting the Effect Size in Context. A focus on significance testing means that we sometimes fail to reflect on the modesty of effect sizes. For instance, providing people $1 for a correct answer within the context of an online survey interview is a large premium. And if providing a dollar each on 12 (included) questions nudges people from an average of 4.5 correct responses to 5, it suggests that people are resistant to learning or impressively confident that what they know is right. Leaving $7 on the table tells us more than the .5, around which the paper is written.
More broadly, researchers are obtuse to the point that sometimes what the results show is how impressively modest the movement is when you ratchet up the dosage. For instance, if an overwhelming number of African Americans favor Whites who have scored just a few points more than a Black student, it is a telling testament to their endorsement of meritocracy.
“Thus, when we calculate the net degree of expressive responding by subtracting the acceptance effect from the rejection effect—essentially differencing off the baseline effect of the incentive from the reduction in rumor acceptance with payment—we find that the net expressive effect is negative 0.5%—the opposite sign of what we would expect if there was expressive responding. However, the substantive size of the estimate of the expressive effect is trivial. Moreover, the standard error on this estimate is 10.6, meaning the estimate of expressive responding is essentially zero.”
(Note: This is not a full review of all the claims in the paper. There is more data in the paper than in the quote above. I am merely using the quote to clarify a couple of statistical points.)
There are two main points:
The fact that estimate is close to zero and the s.e. is super fat are technically unrelated. The last line of the quote, however, seems to draw a relationship between the two.
The estimated effect sizes of expressive responding in the literature are much smaller than the s.e. Bullock et al. (Table 2) estimate the effect of expressive responding at about 4% and Prior et al. (Figure 1) at about ~ 5.5% (“Figure 1(a) shows, the model recovers the raw means from Table 1, indicating a drop in bias from 11.8 to 6.3.”). Thus, one reasonable inference is that the study is underpowered to reasonably detect expected effect sizes.
Why are things the way they are? What is the effect of something? Both of these reverse and forward causation questions are vital.
When I was at Stanford, I took a class with a pugnacious psychometrician, David Rogosa. David had two pet peeves, one of which was people making causal claims with observational data. And it is in David’s class that I learned the pejorative for such claims. With great relish, David referred to such claims as ‘casual inference.’ (Since then, I have come up with another pejorative phrase for such claims—cosal inference—as in merely dressing up as causal inference.)
It turns out that despite its limitations, casual inference is quite common. Here are some fashionable costumes:
7 Habits of Successful People: We have all seen business books with such titles. The underlying message of these books is: adopt these habits, and you will be successful too! Let’s follow the reasoning and see where it falls apart. One stereotype about successful people is that they wake up early. And the implication is you wake up early you can be successful too. It *seems* right. It agrees with folk wisdom that discomfort causes success. But can we reliably draw inferences about what less successful people should do based on what successful people do? No. For one, we know nothing about the habits of less successful people. It could be that less successful people wake up *earlier* than the more successful people. Certainly, growing up in India, I recall daily laborers waking up much earlier than people living in bungalows. And when you think of it, the claim that servants wake up before masters seems uncontroversial. It may even be routine enough to be canonized as a law—the Downtown Abbey law. The upshot is that when you select on the dependent variable, i.e., only look at cases where the variable takes certain values, e.g., only look at the habits of financially successful people, even correlation is not guaranteed. This means that you don’t even get to mock the claim with the jibe that “correlation is not causation.”
Let’s go back to Goji’s delivery service for another example. One of the ‘tricks’ that we had discussed was to sample failures. If you do that, you are selecting on the dependent variable. And while it is a good heuristic, it can lead you astray. For instance, let’s say that most of the late deliveries our early morning deliveries. You may infer that delivering at another time may improve outcomes. Except, when you look at the data, you find that the bulk of your deliveries are in the morning. And the rate at which deliveries run late is *lower* early morning than during other times.
There is a yet more famous example of things going awry when you select on the dependent variable. During World War II, statisticians were asked where the armor should be added on the planes. Of the aircraft that returned, the damage was concentrated in a few areas, like the wings. The top-of-head answer is to suggest we reinforce areas hit most often. But if you think about the planes that didn’t return, you get to the right answer, which is that we need to reinforce areas that weren’t hit. In literature, people call this kind of error, survivorship bias. But it is a problem of selecting on the dependent variable (whether or not a plane returned) and selecting on planes that returned.
More frequent system crashes cause people to renew their software license. It is a mistake to treat correlation as causation. There are many different reasons behind why doing so can lead you astray. The rarest reason is that lots of odd things are correlated in the world because of luck alone. The point is hilariously illustrated by a set of graphs showing a large correlation between conceptually unrelated things, e.g., there is a large correlation between total worldwide non-commercial space launches and the number of sociology doctorates that are awarded each year.
A more common scenario is illustrated by the example in the title of this point. Commonly, there is a ‘lurking’ or ‘confounding’ variable that explains both sides. In our case, the more frequently a person uses a system, the more the number of crashes. And it makes sense that people who use the system most frequently also need the software the most and renew the license most often.
Another common but more subtle reason is called Simpson’s paradox. Sometimes the correlation you see is “wrong.” You may see a correlation in the aggregate, but the correlation runs the opposite way when you break it down by group. Gender bias in U.C. Berkeley admissions provides a famous example. In 1973, 44% of the men who applied to graduate programs were admitted, whereas only 35% of the women were. But when you split by department, which eventually controlled admissions, women generally had a higher batting average than men. The reason for the reversal was women applied more often to more competitive departments, like—-wait for it—-English and men were more likely to apply to less competitive departments like Engineering. None of this is to say that there isn’t bias against women. It is merely to point out that the pattern in aggregated data may not hold when you split the data into relevant chunks.
It is also important to keep in mind the opposite of correlation is not causation—lack of correlation does not imply a lack of causation.
Mayor Giuliani brought the NYC crime rate down. There are two potential errors here:
Forgetting about ecological trends. Crime rates in other big US cities went down at the same time as they did in NY, sometimes more steeply. When faced with a causal claim, it is good to check how ‘similar’ people fared. The Difference-in-Differences estimator that builds on this intuition.
Treating temporally proximate as causal. Say you had a headache, you took some medicine and your headache went away. It could be the case that your headache went away by itself, as headaches often do.
I took this homeopathic medication and my headache went away. If the ailments are real, placebo effects are a bit mysterious. And mysterious they may be but they are real enough. Not accounting for placebo effects misleads us to ascribe the total effect to the medicine.
Shallow causation. We ascribe too much weight to immediate causes than to causes that are a few layers deeper.
Monocausation: In everyday conversations, it is common for people to speak as if x is the only cause of y.
Big Causation: Another common pitfall is reading x causes y as x causes y to change a lot. This is partly a consequence of mistaking statistical significance with substantive significance, and partly a consequence of us not paying close enough attention to numbers.
Same Effect: Lastly, many people take causal claims to mean that the effect is the same across people.
Setting aside concerns about sampling, the quality of survey responses on popular survey platforms is abysmal (see here and here). Both insincere and inattentive respondents are at issue. A common strategy for identifying inattentive respondents is to use attention checks. However, many of these attention checks stick out like sore thumbs. The upshot is that an experience respondent can easily spot them. A parallel worry about attention checks is that inexperienced respondents may be confused by them. To address the concerns, we need a new way to identify inattentive respondents. One way to identify such respondents is to measure twice. More precisely, measure immutable or slowly changing traits, e.g., sex, education, etc., twice across closely spaced survey waves. Then, code cases where people switch answers across the waves on such traits as problematic. And then, use survey items, e.g., self-reports and metadata, e.g., survey response time, metadata on IP addresses, etc. in the first survey to predict problematic switches using modern ML techniques that allow variable selection like LASSO (space is at a premium). Assuming the equation holds, future survey creators can use the variables identified by LASSO to identify likely inattentive respondents.
Recommendation systems are ubiquitous. They determine what videos and news you see, what books and products are ‘suggested’ to you, and much more. If asked about the origins of personalization, my hunch is that some of us will pin it to the advent of the Netflix Prize. Wikipedia goes further back—it puts the first use of the term ‘recommender system’ in 1990. But the history of personalization is much older. It is at least as old as heterogeneous treatment effects (though latent variable models might be a yet more apt starting point). I don’t know for how long we have known about heterogeneous treatment effects but it can be no later than 1957 (Cronbach and Goldine Gleser, 1957).
Here’s Ed Haertel:
“I remember some years ago when NetFlix founder Reed Hastings sponsored a contest (with a cash prize) for data analysts to come up with improvements to their algorithm for suggesting movies subscribers might like, based on prior viewings. (I don’t remember the details.) A primitive version of the same problem, maybe just a seed of the idea, might be discerned in the old push in educational research to identify “aptitude-treatment interactions” (ATIs). ATI research was predicated on the notion that to make further progress in educational improvement, we needed to stop looking for uniformly better ways to teach, and instead focus on the question of what worked for whom (and under what conditions). Aptitudes were conceived as individual differences in preparation to profit from future learning (of a given sort). The largely debunked notion of “learning styles” like a visual learner, auditory learner, etc., was a naïve example. Treatments referred to alternative ways of delivering instruction. If one could find a disordinal interaction, such that one treatment was optimum for learners in one part of an aptitude continuum and a different treatment was optimum in another region of that continuum, then one would have a basis for differentiating instruction. There are risks with this logic, and there were missteps and misapplications of the idea, of course. Prescribing different courses of instruction for different students based on test scores can easily lead to a tracking system where high performing students are exposed to more content and simply get further and further ahead, for example, leading to a pernicious, self-fulfilling prophecy of failure for those starting out behind. There’s a lot of history behind these ideas. Lee Cronbach proposed the ATI research paradigm in a (to my mind) brilliant presidential address to the American Psychological Association, in 1957. In 1974, he once again addressed the American Psychological Association, on the occasion of receiving a Distinguished Contributions Award, and in effect said the ATI paradigm was worth a try but didn’t work as it had been conceived. (That address was published in 1975.)
This episode reminded me of the “longstanding principle in statistics, which is that, whatever you do, somebody in psychometrics already did it long before. I’ve noticed this a few times.”
Reading Cronbach today is also sobering in a way. It shows how ad hoc the investigation of theories and coming up with the right policy interventions was.
In sport, as in life, luck plays a role. For instance, in cricket, there is a toss at the start of the game. And the team that wins the toss wins the game 3% more often. The estimate of the advantage from winning the toss, however, is likely an underestimate of the maximum potential benefit of winning the toss. The team that wins the toss gets to decide whether to bat or bowl first. And 3% reflects the maximum benefit only when the team that won the toss chooses optimally.
The same point applies to estimates of heterogeneity. Say that you estimate how the probability of winning varies by the decision to bowl or bat first after winning the toss. (The decision to bowl or bat first is made before the toss.) And say, 75% of the time team that wins the toss chooses to bat first and wins these games 55% of the time. 25% of the time, teams decide to bowl first and win about 47% of these games. Winning rates of 55% and 47% would be likely yet higher if the teams chose optimally.
In the absence of other data, heterogeneous treatment effects give clear guidance on where the payoffs are higher. For instance, if you find that showing an ad on Chrome has a larger treatment effect, barring other information (and concerns), you may want to only show ads to people who use Chrome to increase the treatment effect. But the decision to bowl or bat first is not a traditional “covariate.” It is a dummy that captures the human judgment about pre-match observables. The interpretation of the interaction term thus needs care. For instance, in the example above, the winning percentage of 47% for teams that decide to bowl first looks ‘wrong’—how can the team that wins the toss lose more often than win in some cases? Easy. It can happen because the team decides to bowl in cases where the probability of winning is lower than 47%. Or it can be that the team is making a bad decision when opting to bowl first.
The mortality rate is puzzling to mortals. A better number is the expected number of years lost. (A yet better number would be quality-adjusted years lost.) To make it easier to calculate the expected years lost, Suriyan and I developed a Python package that uses the SSA actuarial data and life table to estimate the expected years lost.
We illustrate the use of the package by estimating the average number of years by which people’s lives are shortened due to coronavirus (see Note 1 at the end of the article). Using data from Table 1 of the paper that gives us the distribution of ages of people who died from COVID-19 in China, with conservative assumptions (assuming the gender of the dead person to be male, taking the middle of age ranges) we find that people’s lives are shortened by about 11 years on average. These estimates are conservative for one additional reason: there is likely an inverse correlation between people who die and their expected longevity. And note that given a bulk of the deaths are among older people, when people are more infirm, the quality-adjusted years lost is likely yet more modest. Given that the last life tables from China are from 1981 and given life expectancy in China has risen substantially since then (though most gains come from reductions in childhood mortality, etc.), we exploit the recent data from the US, assuming as-if people have the same life tables as Americans. Using the most recent SSA data, we find that the number to be 16. Compare this to deaths from road accidents, the modal reason for death among 5-24, and 25-44 ages in the US. Assuming everyone who dies from a traffic accident is a man, and assuming the age of death to be 25, we get ~52 years, roughly 3x as large as that of coronavirus (see Note 3 at the end of the article). On the other hand, smoking on average shortens life by about seven years. (See Note 2 at the end of the article.)
8/4 Addendum: Using COVID-19 Electronic Death Certification Data (CEPIDC), like above, we estimate the average number of years lost by people dying of coronavirus. With conservative assumptions (assuming the gender of the dead person to be male, taking the middle of age ranges) we find that people’s lives are shortened by about 9 years on average. Surprisingly, the average number of years lost of the people dying of coronavirus remained steady at about 9 years between March and July 2020.
Note 1: Years lost is not sufficient to understand the impact of Covid-19. Covid-19 has had dramatic consequences on the quality of life and has had a large financial impact, among other things. It is useful to account for those when estimating the net impact of Covid-19.
Note 2: In the calculations above, we assume that all the deaths from Coronavirus have been observed. One could do the calculation differently by tracking life spans of people infected with Covid-19 and comparing it to a similar set of people who were never infected with Covid-19. Presumably, the average years lost for people who don’t die of Covid-19 when they are first infected is a lot lower. Thus, counting them would bring the average years lost way down.
Note 3: The net impact of Covid-19 on years lost in the short-term should plausibly account for years saved because of fewer traffic accidents, etc.
You may have heard that most published research is false (Ionnadis). But what you probably don’t know is that most corporate data science is also false.
The returns on data science in most companies are likely sharply negative. There are a few reasons for that. First, as with any new ‘hot’ field, the skill level of the average worker is low. Second, the skill level of the people managing these workers is also low—most struggle to pose good questions, and when they stumble on one, they struggle to answer it well. Third, data science often fails silently (or there is enough corporate noise around it that most failures are well-hidden in plain sight), so the opportunity to learn from mistakes is small. And if that was not enough, many companies reward speed over correctness, and in doing that, often obtain neither.
How can we improve on the status quo? The obvious remedy for the first two issues is to increase the skill by improving training or creating specializations. And one remedy for the latter two points is to create incentives for doing things correctly.
Increasing training and creating specializations in data science is expensive and slow. Vital, but slow. Creating the right incentives for good data science work is not trivial either. There are at least two large forces lined up against it: incompetent supervisors and the fluid and collaborative nature of work—work usually involves multiple people, and there is a fluid exchange of ideas. Only the first is fixable—the latter is a property of work. And fixing it comes down to making technical competence a much more important criterion for hiring.
Aside from hiring more competent workers or increasing the competence of workers, you can also simulate the effect by using checklists—increase quality by creating a few “pause points”—times during a process where the person (team) pauses and goes through a standard list of questions.
To give body to the boast, let me list some common sources of failures in DS and how checklists at different pause points may reduce failure.
Learn what you will lose in translation. Good data science begins with a good understanding of the problem you are trying to solve. Once you understand the problem, you need to translate it into a suitable statistical analog. During translation, you need to be aware that you will lose something in the translation.
Learn the limitations. Learn what data you would love to have to answer the question if money was no object. And use it to understand how far do you fall short from that ideal and then come to a judgment about whether the question can be answered reasonably with the data at hand.
Learn how good the data are. You may think you have the data, but it is best to verify it. For instance, it is good practice to think through the extent to which a variable captures the quantity of interest.
Learn the assumptions behind the formulas you use and test the assumptions to find the right thing to do. Thou shall only use math formulas when you know the limitations of such formulas. Having a good grasp of when formulas don’t work is essential. For instance, say the task is to describe a distribution. Someone may use the mean and standard deviation to describe it. But we know that these sufficient statistics vary by distribution. For binomial, it may just be p. A checklist for “describing” a variable can be:
check skew by plotting: averages are useful when distributions are symmetric, and lots of observations are close to the mean. If skewed, you may want to describe various percentiles.
how many missing values and what explains the missing values.
check for unusual values and what explains the ‘unusual’ values.