The Missing Plot

27 May

Datasets often contain missing values. And often enough—at least in social science data—values are missing systematically. So how do we visualize missing values? After all, they are missing.

Some analysts simply list-wise delete points with missing values. Others impute, replacing missing values with mean or median. Yet others use sophisticated methods to impute missing values. None of the methods, however, automatically acknowledge that any of the data are missing in the visualizations.

It is important to acknowledge missing data.

One can do it is by providing a tally of how much data are missing on each of the variables in a small table in the graph. Another, perhaps better, method is to plot the missing values as a function of a covariate. For bivariate graphs, the solution is pretty simple. Create a dummy vector that tallies missing values. And plot the dummy vector in addition to the data. For instance, see:
missing

(The script to produce the graph can be downloaded from the following GitHub Gist.)

In cases, where missing values are imputed, the dummy vector can also be used to ‘color’ the points that were imputed.