“Data is the new oil,” according to Clive Humby. But we have yet to build an engine that uses the oil efficiently and doesn’t produce a ton of soot. Using data to discover and triage problems is especially polluting. Working with data for well over a decade, I have learned some tricks that produce less soot and more light. Here’s a synopsis of a few things that I have learned.
- Is the Problem Worth Solving? There is nothing worse than solving the wrong problem. You spend time and money and get less than nothing in return—you squander the opportunity to solve the right problem. So before you turn to solutions, find out if the problem is worth solving.
To illustrate the point, let’s follow Goji. Goji runs a delivery business. Goji’s business has an apparent problem. The company’s couriers have a habit of delivering late. At first blush, it seems like a big problem. But is it? To answer that, one good place to start is by quantifying how late the couriers arrive. Let’s say that most couriers arrive within 30 minutes of the appointment time. It seems promising but we still can’t tell whether it is good or bad. To find out, we could ask the customers. But asking customers is a bad idea. Even if the customers don’t care about their deliveries running late, it doesn’t cost them a dime to say that they care. Finding out how much they care is better. Find out the least amount of money the customers will happily accept in lieu of you running 30 minutes to the delivery. It may turn out that most customers don’t care—they will happily accept some trivial amount in lieu of a late delivery. Or it may turn out that customers only care when you deliver frozen or hot food. This still doesn’t give you the full picture. To get yet more clarity on the size of the problem, check how your price adjusted quality compares to other companies.
Misestimating what customers will pay for something is just one of the ways to the wrong problem. Often, the apparent problem is merely an artifact of the measurement error. For instance, it may be that we think the couriers arrive late because our mechanism for capturing arrival is imperfect—couriers deliver on time but forget to tap the button acknowledging they have delivered. Automated check-in based on geolocation may solve the problem. Or incentivizing couriers to be prompt may solve it. But either way, the true problem is not late arrivals but mismeasurement.
Wrong problems can be found in all parts of problem-solving. During software development, for instance, “[p]rogrammers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs,” according to Donald Knuth. (Knuth called the tendency “premature optimization.”) Worse, Knuth claims that “these attempts at efficiency actually ha[d] a strong negative impact” on how maintainable the code is.
Often, however, you are not solving the wrong problem. You are just solving it at the wrong time. The conventional workflow of problem-solving is discovery, estimating opportunity, estimating investment, prioritizing, execution, and post-execution discovery, where you begin again. To find out what to focus on now, you need to get till prioritization. There are some rules of thumb, however, that can help you triage. 1. Fix upstream problems before downstream problems. The fixes upstream may make the downstream improvements moot. 2.diff
the investment and returns based on optimal future workflow. If you don’t do that, you are committing to scrapping later a lot of what you build today. 3. Even on the best day, estimating the return on investment is a single decimal science. 4. You may find that there is no way to solve the problem with the people you have.
- MECE: Management consultants swear by it, so it can’t be a good idea. Right? It turns out that it is. Relentlessly working to pare down the problem into independent parts is among the most important tricks of the trade. Let’s see it in action. After looking at the data, Goji finds that arriving late is a big problem. So you know that it is the right problem but don’t know why your couriers are failing. You apply MECE. You reason that it could be because you have ‘bad’ couriers. Or because you are setting good couriers up for failure. These mutually exclusive comprehensively exhaustive parts can be broken down further. In fact, I think there is a law: the number of independent parts that a problem can be pared down is always one more than you think it is. Here, for instance, you may be setting couriers up to fail by giving them too little lead time or by not providing them precise directions. If you go down yet another layer, the short lead time may be a result of you taking too long to start looking for a courier or because it takes you a long time to find the right courier. So on and so forth. There is no magic to this. But there is no science to it either. MECE tells you what to do but not how to do it. We discuss how to in subsequent points.
- Funnel or the Plinko: The layered approach to MECE reminds most data scientists of the ‘funnel.’ Start with 100% and draw your Sankey diagram, popularized by Minard’s Napolean goes to Russia.
Funnels are powerful tools capturing two important aspects: how much do we lose in each step, and where the losses come from. There is, however, one limitation of funnels—the need for categorical variables. When you have continuous variables, you need to decide smartly about how to discretize. Following the example we have been using, the heads-up we give to our couriers to pick something and deliver to the customer is one such continuous variable. Rather than break it into arbitrarily granular chunks, it is better to plot how lateness varies by lead time and then categorize at places where the slope changes dramatically.
There are three things to be cautious about when building and using funnels. The first is that funnels treat correlation as causation. The second is Simpson’s paradox which deals with issues of aggregation in observational data. And the third is how coarseness of the funnel can lead to mistaken inferences. For instance, you may not see the true impact of having too little time to find a courier because you raise the prices where you have too little time. - Systemic Thinking: It pays to know how the cookie is baked. Learn how the data flows through the system and what decisions we make at what point with what data and what assumptions to what purpose. The conventional tools are flow chart and process tracing. Keeping with our example, say we have a system that lets customers know when we are running late. And let’s assume that not only do we struggle to arrive on time, we also struggle to let people know when we are running late. An engineer may split the problem into an issue with detection or an issue with communication. The detection system may be made up of measuring where the courier is and estimating the time it takes to get to the destination. And either may be broken. And communication issues may be stem from problems with sending emails or issues with delivery, e.g., email being flagged as spam.
- Sample Failures: One way to diagnose problems is to look at a few examples closely. This is a good way to understand what could go wrong. For instance, it may allow you to discover that the locations you are getting from the couriers are wrong because the locations received a minute apart are hundreds of miles apart. This can then lead you to the diagnosis that your application is installed on multiple devices, and you cannot distinguish between data emitted by various devices.
- Worst Case: When looking at examples, start with the worst errors. The intuition is simple: worst errors are often the sites for obvious problems.
- Correlation is causation. To gain more traction, compare the worst with the best. Doing that allows you to see what is different between the two. The underlying idea is, of course, treating correlation as causation. And that is a famous warning. But often enough, correlation points in the right direction.
- Exploit the Skew: The Pareto principle—the 80/20 rule—holds in many places. Look for it. Rather than solve the entire pie, check if the opportunity is concentrated in small places. It often is. Pursuing our example above, it could be that a small proportion of our couriers account for a majority of the late deliveries. Or it could be that a small number of incorrect addresses our causing most of our late deliveries by waylaying couriers.
- Under good conditions, how often do we fail? How do you know how big of an issue a particular problem is? Say, for instance, you want to learn how big a role bad location data plays in our ability to notify. To do that, you should filter to cases where you have great location data and then see how well you can do. And then figure out the proportion of cases where we have great location data.
- Dr. House: The good doctor was a big believer in differential diagnosis. Dr. House often eliminated potential options by evaluating how patients responded to different treatment regimens. For instance, he would put people on an antibiotic course to eliminate infection as an option. The more general strategy is experimentation: learn by doing something.
Experimentation is a sine-qua-non where people are involved. The impact of code is easy to simulate. But we cannot answer how much paying $10 per on-time delivery will increase on-time delivery. We need to experiment.