The KNN classifier is one of the most intuitive ML algorithms. It predicts class by polling k nearest neighbors. Because it seems so simple, it is easy to miss a couple of the finer points:

**Sample Splitting**: Traditionally, when we split the sample, there is no peeking across samples. For instance, when we split the sample between a train and test set, we cannot look at the data in the training set when predicting the label for a point in the test set. In knn, this segregation is not observed. Say we partition the training data to learn the optimal k. When predicting a point in the validation set, we must pass the entire training set. Passing the points in the validation set would be bad because then the optimal k will always be 0. (If you ignore*k = 0*, you can pass the rest of the dataset.)**Implementation Differences**: “Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor*k+1*and*k*, have identical distances but different labels, the**results will depend on the ordering of the training data.”**(see here; emphasis mine.)

This matters when the distance metric is discrete, e.g., if you use an edit-distance metric to compare strings. Worse, scikit-learn doesn’t warn users during analysis.

In R, one popular implementation of KNN is in a package called class. (Overloading the word`class`

seems like a bad idea but that’s for a separate thread.) In`class`

, how the function deals with this scenario is decided by an explicit option: “If [the option is] true, all distances equal to the kth largest are included. If [the option is] false, a random selection of distances equal to the kth is chosen to use exactly k neighbours.”

For the underlying problem, there isn’t one clear winning solution. One way to solve the problem is to move from knn to adaptive knn: include all points that are as far away as the kth point. This is what`class`

in R does when the option`all.equal`

is set to True. Another solution is to never change the order in which the data are accessed and to make the order as part of how the model is exported.