## The Base ML Model

12 Jul

The days of the artisanal ML model are mostly over. The artisanal model builds off domain “knowledge” (it can often be considerably less than that, bordering on misinformation). The artisan has long discussions with domain experts about what variables to include and how to include them in the model, often making idiosyncratic decisions about both. Or the artisan thinks deeply and draws on his own well. And then applies a couple of methods to the final feature set of 10s of variables, and out pops “the” model. This is borderline farcical when the datasets are both long and wide. For supervised problems, the low cost, scalable, common sense thing to do is to implement the following workflow:

1. Get good univariate summaries of each column in the data: mean, median, min., max, sd, n_missing for numerics, and the number of unique values, n_missing, frequency count for categories, etc. Use this to diagnose and understand the data. What stuff is common? On what variables do we have bad data? (see pysum.)

2. Get good bivariate summaries. Correlations for continuous variables and differences in means for categorical variables are reasonable. Use this to understand how the variables are related. Use this to understand the data.

3. Create a dummy vector for missing values for each variable

4. Subset on non-sparse columns

5. Regress on all non-sparse columns, ideally using NN, so that you are not in the business of creating interactions and such.

I have elided over a lot of detail. So let’s take a more concrete example. Say you are predicting whether someone will be diagnosed with diabetes in year y given the claims they make in year y-1, y-2, y-3, etc. Say claim for each service and medicine is a unique code. Tokenize all the claim data so that each unique code gets its own column, and filter on the non-sparse codes. How much information about time you want to preserve depends on you. But for the first cut, roll up the data so that code X made in any year is treated equally. Voila! You have your baseline model.

## Optimal Sequence in Which to Schedule Appointments

1 Jul

Say that you have a travel agency. Your job is to book rooms at hotels. Some hotels fill up more quickly than others, and you want to figure out which hotels to book at first so that your net booking rate is as high as it could be the staff you have.

The logic of prioritization is simple: prioritize those hotels where the expected loss if you don’t book now is the largest. The only thing we need to do is find a way to formalize the losses. Going straight to formalization is daunting. A toy example helps.

Imagine that there are two hotels Hotel A and Hotel B where if you call 2-days and 1-day in advance, the chances of successfully booking a room are .8 and .8, and .8 and .5 respectively. You can only make one call a day. So it is Hotel A or Hotel B. Also, assume that failing to book a room at Hotel A and Hotel B costs the same.

If you were making a decision 1-day out on which hotel to call to book, the smart thing would be to choose Hotel A. The probability of making a booking is larger. But ‘larger’ can be formalized in terms of losses. Day 0, the probability goes to 0. So you make .8 units of loss with Hotel A and .5 with Hotel B. So the potential loss from waiting is larger for Hotel A than Hotel B.

If you were asked to choose 2-days out, which one should you choose? In Hotel A, if you forgo 2-days out, your chances of successfully booking a room next day are .8. At Hotel B, the chances are .5. Let’s play out the two scenarios. If we choose to book at Hotel A 2-days out and Hotel B 1-day out, our expected batting average is (.8 + .5)/2. If we choose the opposite, our batting average is (.8 + .8)/2. It makes sense to choose the latter. Framed as expected losses, we go from .8 to .8 or 0 expected loss for Hotel A and .3 expected loss for Hotel B. So we should book Hotel B 2-days out.

Now that we have the intuition, let’s move to 3-days, 2-days, and 1-day out as that generalizes to k-days out nicely. To understand the logic, let’s first work out a 101 probability question. Say that you have two fair coins that you toss independently. What is the chance of getting at least one head? The potential options are HH, HT, TH, and TT. The chance is 3/4. Or 1 minus the chance of getting a TT (or two failures) or 1- .5*.5.

The 3-days out example is next. See below for the table. If you miss the chance of calling Hotel A 3-days out, the expected loss is the decline in success in booking 2-days or 1-day out. Assume that the probabilities 2-days out and 1-day our are independent and it becomes something similar to the example about coins. The probability of successfully booking 2-days and 1-days out is thus 1 – the probability of failure. Calculate expected losses for each and now you have a way to which Hotel to call on Day 3.

|       | 3-day | 2-day | 1-day |
|-------|-------|-------|-------|
| Hotel A | .9    | .9    | .4    |
| Hotel B | .9    | .9    | .9    |

In our example, the number for Hotel A and Hotel B come to 1 – (1/10)*(6/10) and 1 – (1/10)*(1/10) respectively. Based on that, we should call Hotel A 3-days out before we call Hotel B.

## Code 44: How to Read Ahler and Sood

27 Jun

This is a follow-up to the hilarious Twitter thread about the sequence of 44s. Numbers in Perry’s 538 piece come from this paper.

First, yes 44s are indeed correct. (Better yet, look for yourself.) But what do the 44s refer to? 44 is the average of all the responses. When Perry writes “Republicans estimated the share at 46 percent,” (we have similar language in the paper, which is regrettable as it can be easily misunderstood), it doesn’t mean that every Republican thinks so. It may not even mean that the median Republican thinks so. See OA 1.7 for medians, OA 1.8 for distributions, but see also OA 2.8.1, Table OA 2.18, OA 2.8.2, OA 2.11 and Table OA 2.23.

Key points =

1. Large majorities overestimate the share of party-stereotypical groups in the party, except for Evangelicals and Southerners.

2. Compared to what people think is the share of a group in the population, people still think the share of the group in the stereotyped party is greater. (But how much more varies a fair bit.)

3. People also generally underestimate the share of counter-stereotypical groups in the party.

## Automating Understanding, Not Just ML

27 Jun

Some of the most complex parts of Machine Learning are largely automated. The modal ML person types in simple commands for very complex operations and voila! Some companies, like Microsoft (Azure) and DataRobot, also provide a UI for this. And this has generally not turned out well. Why? Because this kind of system does too little for the modal ML person and expects too much from the rest. So the modal ML person doesn’t use it. And the people who do use it, generally use it badly. The black box remains the black box. But not much is needed to place a lamp in this black box. Really, just two things are needed:

1. A data summarization and visualization engine, preferably with some chatbot feature that guides people smartly through the key points, including the problems. For instance, start with univariate summaries, highlighting ranges, missing data, sparsity, and such. Then, if it is a supervised problem, give people a bunch of loess plots or explain the ‘best fitting’ parametric approximations with y in plain English, such as, “people who eat 1 more cookie live 5 minutes shorter on average.”

2. An explanation engine, including what the explanations of observational predictions mean. We already have reasonable implementations of this.

When you have both, you have automated complexity thoughtfully, in a way that empowers people, rather than create a system that enables people to do fancy things badly.

## Talking On a Tangent

22 Jun

What is the trend over the last X months? One estimate of the ‘trend’ over the last k time periods is what I call the ‘hold up the ends’ method. Look at t_k and t_0, get the difference between the two, and divide by the number of time periods. If t_k > t_0, you say that things are going up. If t_k < t_0, you say things are going down. And if they are the same, then you say that things are flat. But this method can elide over important non-linearity. For instance, say unemployment went down in the first 9 months and then went up over the last 3 but ended with t_k < t_0. What is the trend? If by trend, we mean average slope over the last t time periods, and if there is no measurement error, then 'hold up the ends' method is reasonable. If there is measurement error, we would want to smooth the time series first before we hold up the ends. Often people care about 'consistency' in the trend. One estimate of consistency is the following: the proportion of times we get a number of the same sign when we do pairwise comparison of any two time consecutive time periods. Often people also care more about later time periods than earlier time periods. And one could build on that intuition by weighting later changes more.

## Targeting 101

22 Jun

Targeting Economics

Say that there is a company that makes more than one product. And users of any one of its products don’t use all of its products. In effect, the company has a \textit{captive} audience. The company can run an ad in any of its products about the one or more other products that a user doesn’t use. Should it consider targeting—showing different (number of) ads to different users? There are five things to consider:

• Opportunity Cost: If the opportunity is limited, could the company make more profit by showing an ad about something else?
• The Cost of Showing an Ad to an Additional User: The cost of serving an ad; it is close to zero in the digital economy.
• The Cost of a Worse Product: As a result of seeing an irrelevant ad in the product, the user likes the product less. (The magnitude of the reduction depends on how disruptive the ad is and how irrelevant it is.) The company suffers in the end as its long-term profits are lower.
• Poisoning the Well: Showing an irrelevant ad means that people are more likely to skip whatever ad you present next. It reduces the company’s ability to pitch other products successfully.
• Profits: On the flip side of the ledger are expected profits. What are the expected profits from showing an ad? If you show a user an ad for a relevant product, they may not just buy and use the other product, but may also become less likely to switch from your stack. Further, they may even proselytize your product, netting you more users.

I formalize the problem here (pdf).

## Firmly Against Posing Firmly

31 May

“What is crucial for you as the writer is to express your opinion firmly,” writes William Zinsser in “On Writing Well: An Informal Guide to Writing Nonfiction.” To emphasize the point, Bill repeats the point at the end of the paragraph, ending with, “Take your stand with conviction.”

This advice is not for all writers—Bill particularly wants editorial writers to write with a clear point of view.

When Bill was an editorial writer for the New York Herald Tribune, he attended a daily editorial meeting to “discuss what editorials … to write for the next day and what position …[to] take.” Bill recollects,

“Frequently [they] weren’t quite sure, especially the writer who was an expert on Latin America.

“What about that coup in Uruguay?” the editor would ask. “It could represent progress for the economy,” the writer would reply, “or then again it might destabilize the whole political situation. I suppose I could mention the possible benefits and then—”

The editor would admonish such uncertainty with a curt “let’s not go peeing down both legs.”

Bill approves of taking a side. He likes what the editor is saying if not the language. He calls it the best advice he has received on writing columns. I don’t. Certainty should only come from one source: conviction born from thoughtful consideration of facts and arguments. Don’t feign certainty. Don’t discuss concerns in a perfunctory manner. And don’t discuss concerns at the end.

Surprisingly, Bill agrees with the last bit about not discussing concerns in a perfunctory manner at the end. But for a different reason. He thinks that “last-minute evasions and escapes [cancel strength].”

Don’t be a mug. If there are serious concerns, don’t wait until the end to note them. Note them as they come up.

## Sigh-tations

1 May

As a species, we still know very little about the world. But what we know already far exceeds what any of us can learn in a lifetime.

Scientists are acutely aware of the point. They must specialize, as chances of learning all the key facts about anything but the narrowest of the domains are slim. They must also resort to shorthand to communicate what is known and what is new. The shorthand that they use is—citations. However, this vital building block of science is often rife with problems. The three key problems with how scientists cite are:

1. Cite in an imprecise manner. This broad claim is supported by X. Or, our results are consistent with XYZ. (Our results are consistent with is consistent with directional thinking than thinking in terms of effect size. That means all sorts of effects are consistent, even those 10x as large.) For an example of how I think work should be cited, see Table 1 of this paper.

2. Do not carefully read what they cite. This includes misstating key claims and citing retracted articles approvingly (see here). The corollary is that scientists do not closely scrutinize papers they cite, with the extent of scrutiny explained by how much they agree with the results (see the next point). For a provocative example, see here.)

3. Cite in a motivated manner. Scientists ‘up’ the thesis of articles they agree with, for instance, misstating correlation as causation. And they blow up minor methodological points with articles whose results their paper’s result is ‘inconsistent’ with. (A brief note on motivated citations: here).

## “Cosal” Inference

27 Apr

We often make causal claims based on fallible heuristics. Some of the heuristics that we commonly use to make causal claims are:

1. Selecting on the dependent variable. How often have you seen a magazine article with a title like “Five Habits of Successful People”? The implicit message in such articles is that if you were to develop these habits, you would be successful too. The articles never discuss how many unsuccessful people have the same habits or all the other dimensions on which successful and unsuccessful people differ.
2. Believing that correlation implies causation. A common example goes like this: children who watch more television are more violent. From this data, people deduce that watching television causes children to be violent. It is possible, but there are other potential explanations.
3. Believing that events that happen in a sequence are causally related. B follows A so A must cause B. Often there isn’t just one A, but lots of As. And the B doesn’t instantaneously follow A.

Beyond this, people also tend to interpret vague claims such as X causes Y as X causes large changes in Y. (There is likely some motivated aspect to how this interpretation happens.)