Nextdoor

28 Nov

The KNN classifier is one of the most intuitive ML algorithms. It predicts class by polling k nearest neighbors. Because it seems so simple, it is easy to miss a couple of the finer points:

  1. Sample Splitting: Traditionally, when we split the sample, there is no peeking across samples. For instance, when we split the sample between a train and test set, we cannot look at the data in the training set when predicting the label for a point in the test set. In knn, this segregation is not observed. Say we partition the training data to learn the optimal k. When predicting a point in the validation set, we must pass the entire training set. Passing the points in the validation set would be bad because then the optimal k will always be 0. (If you ignore k = 0, you can pass the rest of the dataset.)
  2. Implementation Differences: “Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor k+1 and k, have identical distances but different labels, the results will depend on the ordering of the training data.” (see here; emphasis mine.)

    This matters when the distance metric is discrete, e.g., if you use an edit-distance metric to compare strings. Worse, scikit-learn doesn’t warn users during analysis.

    In R, one popular implementation of KNN is in a package called class. (Overloading the word class seems like a bad idea but that’s for a separate thread.) In class, how the function deals with this scenario is decided by an explicit option: “If [the option is] true, all distances equal to the kth largest are included. If [the option is] false, a random selection of distances equal to the kth is chosen to use exactly k neighbours.”

    For the underlying problem, there isn’t one clear winning solution. One way to solve the problem is to move from knn to adaptive knn: include all points that are as far away as the kth point. This is what class in R does when the option all.equal is set to True. Another solution is to never change the order in which the data are accessed and to make the order as part of how the model is exported.