A/B Testing Recommendation Systems

1 Apr

Say that you are building a news recommender that lists which relevant news items in each person’s news feed. Say that your first version of the news recommender is a rules-based system that uses signals like how many people in your network have seen the news, how many people in total have read the news, the freshness of the news, etc., and sums up the signals in an arbitrary way to rank news items. Your second version uses the same signals but uses a supervised model to decide on the optimal weights.

Say that you find that the recommendations vary a fair bit between the two systems. But which one is better? To suss that, you conduct an A/B test. But a naive experiment will produce biased estimates of the effect and the s.e. because:

  1. The signals on which your control group ranking system on is based are influenced by the kinds of news articles that people in treatment group see. And vice versa.
  2. There is an additional source of stochasticity in recommendations that people see: the order in which people arrive matters.

The effect of the first concern is that our estimates are likely attenuated.  To resolve the first issue, show people in the Control Group news articles based on predicted views of news articles based on historical data or pro-rated views of people assigned to control group alone. (This adds a bit of noise to the Control Group estimates.) And keep a separate table of input data for the treatment group and apply the ML model to the pro-rated data from that table.

The consequence of the second issue is that our s.e. is very plausibly much larger than what we will get with the split world testing (each condition gets its own table of counts for views, etc.). The sequence in which people arrive matters as it intersects with social influence world. To resolve the second issue, you need to estimate how the sequence of arrival affects outcomes. But given the number of pathways, the best we can probably do is bound. We could probably estimate the effect of ranking the least downloaded item first as a way to bound the effects.