Benchmark datasets like MNIST, ImageNet, etc., abound in machine learning. Such datasets stimulate work on a problem by providing an agreed-upon mark to beat. Many of the benchmark datasets, however, are constructed in an ad hoc manner. As a result, it is hard to understand why the best-performing models vary across different benchmark datasets (see here), to compare models, and to confidently prognosticate about performance on a new dataset. To address such issues, in the following paragraphs, we provide a framework for building a good benchmark dataset.
I am looking for feedback. Please let me know how I can improve this.