The Base ML Model

12 Jul

The days of the artisanal ML model are mostly over. The artisanal model builds off domain “knowledge” (it can often be considerably less than that, bordering on misinformation). The artisan has long discussions with domain experts about what variables to include and how to include them in the model, often making idiosyncratic decisions about both. Or the artisan thinks deeply and draws on his own well. And then applies a couple of methods to the final feature set of 10s of variables, and out pops “the” model. This is borderline farcical when the datasets are both long and wide. For supervised problems, the low cost, scalable, common sense thing to do is to implement the following workflow:

1. Get good univariate summaries of each column in the data: mean, median, min., max, sd, n_missing for numerics, and the number of unique values, n_missing, frequency count for categories, etc. Use this to diagnose and understand the data. What stuff is common? On what variables do we have bad data? (see pysum.)

2. Get good bivariate summaries. Correlations for continuous variables and differences in means for categorical variables are reasonable. Use this to understand how the variables are related. Use this to understand the data.

3. Create a dummy vector for missing values for each variable

4. Subset on non-sparse columns

5. Regress on all non-sparse columns, ideally using NN, so that you are not in the business of creating interactions and such.

I have elided over a lot of detail. So let’s take a more concrete example. Say you are predicting whether someone will be diagnosed with diabetes in year y given the claims they make in year y-1, y-2, y-3, etc. Say claim for each service and medicine is a unique code. Tokenize all the claim data so that each unique code gets its own column, and filter on the non-sparse codes. How much information about time you want to preserve depends on you. But for the first cut, roll up the data so that code X made in any year is treated equally. Voila! You have your baseline model.