Clustering and then Classifying: Improving Prediction for Crudely-labeled and Mislabeled Data

8 Jun

Mislabeled and crudely labeled data are common problems in data science. Supervised prediction of such data expectedly yields poor results—coefficients are biased, and accuracy with regards to the true label is poor. One solution to the problem is to hand code labels of an `adequate’ sample and infer true labels based on a model trained on that data.

Another solution relies on the intuition (assumption) that the distance between rows (covariates) of a label will be lower than the distance between rows of different labels. One way to leverage that intuition is to cluster the data within each label, infer true labels from erroneous labels, and then predict inferred true labels. For a class of problems, the method can be shown to always improve accuracy. (You can also predict just the cluster labels.)

Here’s one potential solution for a scenario where we have a binary dependent variable:

Assume a mis_labeled vector called mis_label that codes some true 0s as 1 and some true 1s as 0s.

  1. For each mis_label (1 and 0), use k-means with k = 2 to get 2 clusters within each label for a total of 4 labels
  2. Assuming mislabeling rate < 50%, create a new col. = est_true_label, which takes: 1 when mis_label = 1 and cluster label is of the majority class (that is cluster label class is more than 50% of the mis_label = 1), otherwise 0. 0 when mis_label = 0 and cluster label is of the majority class (that is cluster label class is more than 50% of the mis_label = 0), otherwise 1.
  3. Predict est_true_label using logistic regression and produce accuracy estimates based on true_labels and bias estimates in coefficient estimates (compared to coefficients from logistic regression coefficients from true labels)