You may have heard that most published research is false (Ionnadis). But what you probably don’t know is that most corporate data science is also false.
Gaurav Sood
The returns on data science in most companies are likely sharply negative. There are a few reasons for that. First, as with any new ‘hot’ field, the skill level of the average worker is low. Second, the skill level of the people managing these workers is also low—most struggle to pose good questions, and when they stumble on one, they struggle to answer it well. Third, data science often fails silently (or there is enough corporate noise around it that most failures are well-hidden in plain sight), so the opportunity to learn from mistakes is small. And if that was not enough, many companies reward speed over correctness and in doing that, often obtain neither.
How can we improve on the status quo? The obvious remedy for the first two issues is to increase the skill by improving training or creating specializations. And one remedy for the latter two points is to create incentives for doing things correctly.
Increasing training and creating specializations in data science is expensive and slow. Vital, but slow. Creating the right incentives for good data science work is not trivial either. There are at least two large forces lined up against it: incompetent supervisors and the fluid and collaborative nature of work—work usually involves multiple people, and there is a fluid exchange of ideas. Only the first is fixable—the latter is a property of work. And fixing it comes down to making technical competence a much more important criterion for hiring.
Aside from hiring more competent workers or increasing the competence of workers, you can also simulate the effect by using checklists—increase quality by creating a few “pause points”—times during a process where the person (team) pauses and goes through a standard list of questions.
To give body to the boast, let me list some common sources of failures in DS and how checklists at different pause points may reduce failure.
- Learn what you will lose in translation. Good data science begins with a good understanding of the problem you are trying to solve. Once you understand the problem, you need to translate it into a suitable statistical analog. During translation, you need to be aware that you will lose something in the translation.
- Learn the limitations. Learn what data you would love to have to answer the question if money was no object. And use it to understand how far you fall short of that ideal and then come to a judgment about whether the question can be answered reasonably with the data at hand.
- Learn how good the data are. You may think you have the data, but it is best to verify it. For instance, it is good practice to think through the extent to which a variable captures the quantity of interest.
- Learn the assumptions behind the formulas you use and test the assumptions to find the right thing to do. Thou shall only use math formulas when you know the limitations of such formulas. Having a good grasp of when formulas don’t work is essential. For instance, say the task is to describe a distribution. Someone may use the mean and standard deviation to describe it. But we know that these sufficient statistics vary by distribution. For binomial, it may just be p. A checklist for “describing” a variable can be:
- check skew by plotting: averages are useful when distributions are symmetric and lots of observations are close to the mean. If skewed, you may want to describe various percentiles.
- how many missing values and what explains the missing values?
- check for unusual values and what explains the ‘unusual’ values.