Say that we want to find people to whom a product is relevant. One way to do that is to launch a small campaign advertising the product and learn from people who click on the ad, or better yet, learn from people who not just click on the ad but go and try out the product and end up using it. But if you didn’t have the luxury of running a small campaign and waiting a while, you can learn from organic growth.
Conventionally, people learn from organic growth by posing it as a supervised problem. And they generate the labels as follows: people who have ‘never’ (mostly: in the last 6–12 months) used the product are labeled as 0 and people who “adopted” the product in the latest time period, e.g., over the last month, are labeled 1. People who have used the product in the last 6–12 months or so are filtered out.
There are three problems with generating labels this way. First, not all the people who ‘adopt’ a product continue to use the product. Many of the people who try it find that it is not useful or find the price too high and abandon it. This means that a lot of 1s are mislabeled. Second, the cleanest 1s are the people who ‘adopted’ the product some time ago and have continued to use it since. Removing those is thus a bad idea. Third, the good 0s are those who tried the product but didn’t persist with it not those who never tried the product. Generating the labels in such a manner also allows you to mitigate one of the significant problems with learning from organic growth: people who organically find a product are different from those who don’t. Here, you are subsetting on the kinds of people who found the product, except that one found it useful and another did not. This empirical strategy has its problems, but it is distinctly better than the conventional approach.