Prediction Errors: Using ML For Measurement

Say you want to measure how often people visit pornographic domains over some period. To measure that, you build a model to predict whether or not a domain hosts pornography. And let’s assume that for the chosen classification threshold, the False Positive rate (FP) is 10\% and the False Negative rate (FN) is 7\%. Here below, we discuss some of the concerns with using scores from such a model and discuss ways to address the issues.

Let’s get some notation out of the way. Let’s say that we have n users and that we can iterate over them using i. Let’s denote the total number of unique domains—domains visited by any of the n users at least once during the observation window—by k. And let’s use j to iterate over the domains. Let’s denote the number of visits to domain j by user i by c_{ij} = {0, 1, 2, ....}. And let’s denote the total number of unique domains a person visits (\sum (c_{ij} == 1)) using t_i. Lastly, let’s denote predicted labels about whether or not each domain hosts pornography by p, so we have p_1, ..., p_j, ... , p_k.

Let’s start with a simple point. Say there are 5 domains with p: {1_1, 1_2, 1_3, 1_4, 1_5}. Let’s say user one visits the first three sites once and let’s say that user two visits all five sites once. Given 10\% of the predictions are false positives, the total measurement error in user one’s score = 3 * .10 and the total measurement error in user two’s score = 5 * .10. The general point is that total false positives increase as a function of predicted 1s. And the total number of false negatives increase as the number of predicted 0s.

Read more here.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe