Sometimes Scientists Spread Misinformation

24 Aug

To err is human. Good scientists are aware of that, painfully so. The model scientist obsessively checks everything twice over and still keeps eyes peeled for loose ends. So it is a shock to learn that some of us are culpable for spreading misinformation.

Ken and I find that articles with serious errors, even articles based on fraudulent data, continue to be approvingly cited—cited without any mention of any concern—long after the problems have been publicized. Using a novel database of over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31% of the citations to retracted articles happen a year after the publication of the retraction notice. And that over 90% of these citations are approving.

What gives our findings particular teeth is the role citations play in science. Many, if not most, claims in a scientific article rely on work done by others. And scientists use citations to back such claims. The readers rely on scientists to note any concerns that impinge on the underlying evidence for the claim. And when scientists cite problematic articles without noting any concerns they very plausibly misinform their readers.

Though 74,000 is a large enough number to be deeply concerning, retractions are relatively infrequent. And that may lead some people to discount these results. Retractions may be infrequent but citations to retracted articles post-retraction are extremely revealing. Retractions are a low-low bar. Retractions are often a result of convincing evidence of serious malpractice, generally fraud or serious error. Anything else, for example, a serious error in data analysis, is usually allowed to self-correct. And if scientists are approvingly citing retracted articles after they have been retracted, it means that they have failed to hurdle the low-low bar. Such failure suggests a broader malaise.

To investigate the broader malaise, Ken and I exploited data from an article published in Nature that notes a statistical error in a series of articles published in prominent journals. Once again, we find that approving citations to erroneous articles persist after the error has been publicized. After the error has been publicized, the rate of citation to erroneous articles is, if anything, higher, and 98% of the citations are approving.

In all, it seems, we are failing.

The New Unit of Scientific Production

11 Aug

One fundamental principle of science is that there is no privileged observer. You get to question what people did. But to question, you first must know what people did. So part of good scientific practice is to make it easy for people to understand how the sausage was made—how the data were collected, transformed, and analyzed—and ideally, why you chose to make the sausage that particular way. Papers are ok places for describing all this, but we now have better tools: version controlled repositories with notebooks and readme files.

The barrier to understanding is not just lack of information, but also poorly organized information. There are three different arcs of information: cross-sectional (where everything is and how it relates to each other), temporal (how the pieces evolve over time), and inter-personal (who is making the changes). To be organized cross-sectionally, you need to be macro organized (where is the data, where are the scripts, what do each of the scripts do, how do I know what the data mean, etc.), and micro organized (have logic and organization to each script; this also means following good coding style). Temporal organization in version control simply requires you to have meaningful commit messages. And inter-personal organization requires no effort at all, beyond the logic of pull requests.

The obvious benefits of this new way are known. But what is less discussed is that this new way allows you to critique specific pull requests and decisions made in certain commits. This provides an entirely new way to make progress in science. The new unit of science also means that we just don’t dole out credits in crude currency like journal articles but we can also provide lower denominations. We can credit each edit, each suggestion. And why not. The third big benefit is that we can build epistemological trees where the logic of disagreement is clear.

The dead tree edition is dead. It is also time to retire the e-version of the dead tree edition.

Quality Data: Plumbing ML Data Pipelines

6 Aug

What’s the difference between a scientist and a data scientist? Scientists often collect their own data, and data scientists often use data collected by other people. That is part jest but speaks to an important point. Good scientists know their data. Good data scientists must know their data too. To help data scientists learn about the data they use, we need to build systems that give them good data about the data. But what is good data about the data? And how do we build systems that deliver that? Here’s some advice (tailored toward rectangular data for convenience):

  • From Where, How Much, and Such
    • Provenance: how were each of the columns in the data created (obtained)? If the data are derivative, find out the provenance of the original data. Be as concrete as possible, linking to scripts, related teams, and such.
    • How Frequently is it updated
    • Cost per unit of data, e.g., a cell in rectangular data.
    Both, the frequency with which data are updated, and the cost per unit of data may change over time. Provenance may change as well: a new team (person) may start managing data. So the person who ‘owns’ the data must come back to these questions every so often. Come up with a plan.
  • What? To know what the data mean, you need a data dictionary. A data dictionary explains the key characteristics of the data. It includes:
    1. Information about each of the columns in plain language.
    2. How were the data were collected? For instance, if you conducted a survey, you need the question text and the response options (if any) that were offered, along with the ‘mode’, and where in a sequence of questions does this lie, was it alone on the screen, etc.
    3. Data type
    4. How (if at all) are missing values generated?
    5. For integer columns, it gives the range, sd, mean, median, n_0s, and n_missing. For categorical, it gives the number of unique values, what each label means, and a frequency table that includes n_missing (if missing can be of multiple types, show a row for each).
    6. The number of duplicates in data and if they are allowed and a reason for why you would see them. 
    7. Number of rows and columns
    8. Sampling
    9. For supervised models, store correlation of y with key x_vars
  • What If? What if you have a question? Who should you bug? Who ‘owns’ the ‘column’ of data?

Store these data in JSON so that you can use this information to validate against. Produce the JSON for each update. You can flag when data are some s.d. above below last ingest.

Store all this metadata with the data. For e.g., you can extend the dataframe class in Scala to make it so.

Auto-generate reports in markdown with each ingest.

In many ML applications, you are also ingesting data back from the user. So you need the same as above for the data you are getting from the user (and some of it at least needs to match the stored data). 

For any derived data, you need the scripts and the logic, ideally in a notebook. This is your translation function.

Where possible, follow the third normal form of databases. Only store translations when translation is expensive. Even then, think twice.

Lastly, some quality control. Periodically sit down with your team to see if you should see what you are seeing. For instance, if you are in the survey business, do the completion times make sense? If you are doing supervised learning, get a random sample of labels. Assess their quality. You can also assess the quality by looking at errors in classification that your supervised model makes. Are the errors because the data are mislabeled? Keep iterating. Keep improving. And keep cataloging those improvements. You should be able to ‘diff’ data collection, not just numerical summaries of data. And with what the method I highlight above, you should be.

Optimal Sequence in Which to Service Orders

27 Jul

What is the optimal order in which to service orders assuming a fixed budget?

Let’s assume that we have to service orders o_1,…,…o_n, with the n orders iterated by i. Let’s also assume that for each service order, we know how the costs change over time. For simplicity, let’s assume that time is discrete and portioned in units of days. If we service order o_i at time t, we expect the cost to be c_it. Each service order also has an expiration time, j, after which the order cannot be serviced. The cost at expiration time, j, is the cost of failure and denoted by c_ij.

The optimal sequence of servicing orders is determined by expected losses—service the order first where the expected loss is the greatest. This leaves us with the question of how to estimate expected loss at time t. To come up with an expectation, we need to sum over some probability distribution. For o_it, we need the probability, p_it, that we would service o_i at t+1 till j. And then, we need to multiply p_it with c_ij. So framed, the expected loss for order i at time t =
c_it – \Sigma_{t+1}_{j} p_it * c_it

However, determining p_it is not straightforward. New items are added to the queue at t+1. On the flip side, we also get to re-prioritize at t+1. The question is if we will get to the item o_i at t+1? (It means p_it is 0 or 1.) For that, we need to forecast the kinds of items in the queue tomorrow. One simplification is to assume that items in the queue today are the same that will be in the queue tomorrow. Then, it reduces to estimating the cost of punting each item again tomorrow, sorting based on the costs at t+1, and checking whether we will get to clear the item. (We can forgo the simplification by forecasting our queue tomorrow, and each day after that till j for each item, and calculating the costs.)

If the data are available, we can tack on clearing time per order and get a better answer to whether we will clear o_it at time t or not.

Pride and Prejudice

14 Jul

It is ‘so obvious’ that policy A >> policy B that only who don’t want to know or who want inferior things would support policy B. Does this conversation remind you of any that you have had? We don’t just have such conversations about policies. We also have them about people. Way too often, we are being too harsh.

We overestimate how much we know. We ‘know know’ that we are right, we ‘know’ that there isn’t enough information in the world that will make us switch to policy B. Often, the arrogance of this belief is lost on us. As Kahneman puts it, we are ‘ignorant of our own ignorance.’ How could it be anything else? Remember the aphorism, “the more you know, the more you know you don’t know”? The aphorism may not be true but it gets the broad point right. The ignorant are overconfident. And we are ignorant. The human condition is such that it doesn’t leave much room for being anything else (see the top of this page).

Here’s one way to judge your ignorance (see here for some other ideas). Start by recounting what you know. Sit in front of a computer and type it up. Go for it. And then add a sentence about how do you know. Do you recall reading any detailed information about this person or issue? From where? Would you have bought a car if you had that much information about a car?

We not just overestimate what we know, we also underestimate what other people know. Anybody with different opinions must know less than I. It couldn’t be that they know more, could it?

Both, being overconfident about what we know and underestimating what other people know leads to the same thing: being too confident about the rightness of our cause and mistaking our prejudices for obvious truths.

George Carlin got it right. “Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?” It seems the way we judge drivers is how we judge everything else. Anyone who knows less than you is either being willfully obtuse or an idiot. And those who know more than you just look like ‘maniacs.’

The Base ML Model

12 Jul

The days of the artisanal ML model are mostly over. The artisanal model builds off domain “knowledge” (it can often be considerably less than that, bordering on misinformation). The artisan has long discussions with domain experts about what variables to include and how to include them in the model, often making idiosyncratic decisions about both. Or the artisan thinks deeply and draws on his own well. And then applies a couple of methods to the final feature set of 10s of variables, and out pops “the” model. This is borderline farcical when the datasets are both long and wide. For supervised problems, the low cost, scalable, common sense thing to do is to implement the following workflow:

1. Get good univariate summaries of each column in the data: mean, median, min., max, sd, n_missing for numerics, and the number of unique values, n_missing, frequency count for categories, etc. Use this to diagnose and understand the data. What stuff is common? On what variables do we have bad data? (see pysum.)

2. Get good bivariate summaries. Correlations for continuous variables and differences in means for categorical variables are reasonable. Use this to understand how the variables are related. Use this to understand the data.

3. Create a dummy vector for missing values for each variable

4. Subset on non-sparse columns

5. Regress on all non-sparse columns, ideally using NN, so that you are not in the business of creating interactions and such.

I have elided over a lot of detail. So let’s take a more concrete example. Say you are predicting whether someone will be diagnosed with diabetes in year y given the claims they make in year y-1, y-2, y-3, etc. Say claim for each service and medicine is a unique code. Tokenize all the claim data so that each unique code gets its own column, and filter on the non-sparse codes. How much information about time you want to preserve depends on you. But for the first cut, roll up the data so that code X made in any year is treated equally. Voila! You have your baseline model.

Optimal Sequence in Which to Schedule Appointments

1 Jul

Say that you have a travel agency. Your job is to book rooms at hotels. Some hotels fill up more quickly than others, and you want to figure out which hotels to book at first so that your net booking rate is as high as it could be the staff you have.

The logic of prioritization is simple: prioritize those hotels where the expected loss if you don’t book now is the largest. The only thing we need to do is find a way to formalize the losses. Going straight to formalization is daunting. A toy example helps.

Imagine that there are two hotels Hotel A and Hotel B where if you call 2-days and 1-day in advance, the chances of successfully booking a room are .8 and .8, and .8 and .5 respectively. You can only make one call a day. So it is Hotel A or Hotel B. Also, assume that failing to book a room at Hotel A and Hotel B costs the same.

If you were making a decision 1-day out on which hotel to call to book, the smart thing would be to choose Hotel A. The probability of making a booking is larger. But ‘larger’ can be formalized in terms of losses. Day 0, the probability goes to 0. So you make .8 units of loss with Hotel A and .5 with Hotel B. So the potential loss from waiting is larger for Hotel A than Hotel B.

If you were asked to choose 2-days out, which one should you choose? In Hotel A, if you forgo 2-days out, your chances of successfully booking a room next day are .8. At Hotel B, the chances are .5. Let’s play out the two scenarios. If we choose to book at Hotel A 2-days out and Hotel B 1-day out, our expected batting average is (.8 + .5)/2. If we choose the opposite, our batting average is (.8 + .8)/2. It makes sense to choose the latter. Framed as expected losses, we go from .8 to .8 or 0 expected loss for Hotel A and .3 expected loss for Hotel B. So we should book Hotel B 2-days out.

Now that we have the intuition, let’s move to 3-days, 2-days, and 1-day out as that generalizes to k-days out nicely. To understand the logic, let’s first work out a 101 probability question. Say that you have two fair coins that you toss independently. What is the chance of getting at least one head? The potential options are HH, HT, TH, and TT. The chance is 3/4. Or 1 minus the chance of getting a TT (or two failures) or 1- .5*.5.

The 3-days out example is next. See below for the table. If you miss the chance of calling Hotel A 3-days out, the expected loss is the decline in success in booking 2-days or 1-day out. Assume that the probabilities 2-days out and 1-day our are independent and it becomes something similar to the example about coins. The probability of successfully booking 2-days and 1-days out is thus 1 – the probability of failure. Calculate expected losses for each and now you have a way to which Hotel to call on Day 3.

|       | 3-day | 2-day | 1-day |
|-------|-------|-------|-------|
| Hotel A | .9    | .9    | .4    |
| Hotel B | .9    | .9    | .9    |

In our example, the number for Hotel A and Hotel B come to 1 – (1/10)*(6/10) and 1 – (1/10)*(1/10) respectively. Based on that, we should call Hotel A 3-days out before we call Hotel B.

Code 44: How to Read Ahler and Sood

27 Jun

This is a follow-up to the hilarious Twitter thread about the sequence of 44s. Numbers in Perry’s 538 piece come from this paper.

First, yes 44s are indeed correct. (Better yet, look for yourself.) But what do the 44s refer to? 44 is the average of all the responses. When Perry writes “Republicans estimated the share at 46 percent,” (we have similar language in the paper, which is regrettable as it can be easily misunderstood), it doesn’t mean that every Republican thinks so. It may not even mean that the median Republican thinks so. See OA 1.7 for medians, OA 1.8 for distributions, but see also OA 2.8.1, Table OA 2.18, OA 2.8.2, OA 2.11 and Table OA 2.23.

Key points =

1. Large majorities overestimate the share of party-stereotypical groups in the party, except for Evangelicals and Southerners.

2. Compared to what people think is the share of a group in the population, people still think the share of the group in the stereotyped party is greater. (But how much more varies a fair bit.)

3. People also generally underestimate the share of counter-stereotypical groups in the party.

Automating Understanding, Not Just ML

27 Jun

Some of the most complex parts of Machine Learning are largely automated. The modal ML person types in simple commands for very complex operations and voila! Some companies, like Microsoft (Azure) and DataRobot, also provide a UI for this. And this has generally not turned out well. Why? Because this kind of system does too little for the modal ML person and expects too much from the rest. So the modal ML person doesn’t use it. And the people who do use it, generally use it badly. The black box remains the black box. But not much is needed to place a lamp in this black box. Really, just two things are needed:

1. A data summarization and visualization engine, preferably with some chatbot feature that guides people smartly through the key points, including the problems. For instance, start with univariate summaries, highlighting ranges, missing data, sparsity, and such. Then, if it is a supervised problem, give people a bunch of loess plots or explain the ‘best fitting’ parametric approximations with y in plain English, such as, “people who eat 1 more cookie live 5 minutes shorter on average.”

2. An explanation engine, including what the explanations of observational predictions mean. We already have reasonable implementations of this.

When you have both, you have automated complexity thoughtfully, in a way that empowers people, rather than create a system that enables people to do fancy things badly.

Talking On a Tangent

22 Jun

What is the trend over the last X months? One estimate of the ‘trend’ over the last k time periods is what I call the ‘hold up the ends’ method. Look at t_k and t_0, get the difference between the two, and divide by the number of time periods. If t_k > t_0, you say that things are going up. If t_k < t_0, you say things are going down. And if they are the same, then you say that things are flat. But this method can elide over important non-linearity. For instance, say unemployment went down in the first 9 months and then went up over the last 3 but ended with t_k < t_0. What is the trend? If by trend, we mean average slope over the last t time periods, and if there is no measurement error, then 'hold up the ends' method is reasonable. If there is measurement error, we would want to smooth the time series first before we hold up the ends. Often people care about 'consistency' in the trend. One estimate of consistency is the following: the proportion of times we get a number of the same sign when we do pairwise comparison of any two time consecutive time periods. Often people also care more about later time periods than earlier time periods. And one could build on that intuition by weighting later changes more.