By Gaurav in ML/Statistics — 31 Aug 2018

Prediction Errors: Using ML For Measurement

Say you want to measure how often people visit pornographic domains over some period. To measure that, you build a model to predict whether or not a domain hosts pornography. And let’s assume that for the chosen classification threshold, the False Positive rate (FP) is 10\% and the False Negative rate (FN) is 7\%. Here below, we discuss some of the concerns with using scores from such a model and discuss ways to address the issues.

Let’s get some notation out of the way. Let’s say that we have

$n$

users and that we can iterate over them using

$i$

. Let’s denote the total number of unique domains—domains visited by any of the

$n$

users at least once during the observation window—by

$k$

. And let’s use

$j$

to iterate over the domains. Let’s denote the number of visits to a domain

$j$

by user

$i$

$c_{ij} = {0, 1, 2, ....}$

. And let’s denote the total number of unique domains a person visits (

$\sum (c_{ij} == 1)$

) using

$t_i$

. Lastly, let’s denote predicted labels about whether or not each domain hosts pornography by

$p$

, so we have

$p_1, ..., p_j, ... , p_k$

Let’s start with a simple point. Say there are 5 domains with

$p$

${1_1, 1_2, 1_3, 1_4, 1_5}$

. Let’s say user one visits the first three sites once, and let’s say that user two visits all five sites once. Given 10\% of the predictions are false positives, the total measurement error in user one’s score

$= 3 * .10$

and the total measurement error in user two’s score

$= 5 * .10$

. The general point is that total false positives increase as a function of predicted

$1s$

. And the total number of false negatives increase as the number of predicted

$0s$

Subscribe to Gojiberries