Say that we want to measure how often people go to risky websites. Let’s assume that the measurement of risk is expensive. We have data on how often people visit each domain on the web from a large sample. The number of unique domains in the data is large, making measuring the population of domains impossible. Say there is a sharp skew in the visitation of domains. What is the fewest number of domains we need to measure to get s.e. of no greater than X per row?
Here are three ideas.
- Base. Sample domains in each row (with replacement) in proportion to views/time to get to the desired s.e. Then, collate the selected domains and get labels for those.
- Exploit the skew. For instance, sample from 99% of the distribution and save yourself from the long tail. Bound each estimate by the unsampled 1% (which could be anything) and enjoy. For greater accuracy, do a smaller, cruder sample of the 1% and get to the +/- 10% with an n = 100. The full version of this point is as follows: we benefit from increasing the probability of including more frequently occurring domains. Taken to the extremum, you could deterministically include the most frequent domains, and then prorate the size of the sample for the rest by the size of the area under the curve. This kind of strategy can help answer: how to optimally sample skewed distributions to get the smallest s.e. with the fewest observations?
- Cheap measures. The base measurement strategy may be expensive but it may be possible to come up with a cheaper, less accurate measurement strategy that you can apply to the long tail. Validate (and calibrate) the results with the expensive coding strategy for a randomly selected sample of respondents.