Superhuman: Can ML Beat Human-Level Performance in Supervised Models?

20 Dec

A supervised model cannot do better than its labels. (I revisit this point later.) So the trick is to make labels as good as you can. The errors in labels stem from three sources: 

  1. Lack of Effort: More effort people spend labeling something, presumably the more accurate it will be.
  2. Unclear Directions: Unclear directions can result from a. poorly written directions, b. conceptual issues, c. poor understanding. Let’s tackle conceptual issues first. Say you are labeling the topic of news articles. Say you come across an article about how Hillary Clinton’s hairstyle has evolved over the years. Should it be labeled as politics, or should it labeled as entertainment (or my preferred label: worthless)? It depends on taste and the use case. Whatever the decision, it needs to be codified (and clarified) in the directions given to labelers. Poor writing is generally a result of inadequate effort.  
  3. Hardness: Is that a 4 or a 7? We have all suffered at the hands of CAPTCHA to know that some tasks are harder than others.   

The fix for the first problem is obvious. To increase effort, incentivize. Incentivize by paying for correctness—measured over known-knowns—or by penalizing mistakes. And by providing feedback to people on the money they lost or how much more others with a better record made.

Solutions for unclear directions vary by the underlying problem. To address conceptual issues, incentivize people to flag (and comment on) cases where the directions are unclear and build a system to collect and review prediction errors. To figure out if the directions are unclear, quiz people on comprehension and archetypal cases. 

Can ML Performance Be Better Than Humans?

If humans label the dataset, can ML be better than humans? The first sentence of the article suggests not. Of course, we have yet to define what humans are doing. If the benchmark is labels provided by a poorly motivated and trained workforce and the model is trained on labels provided by motivated and trained people, ML can do better. The consensus label provided by a group of people will also generally be less noisy than one provided by a single person.    

Andrew Ng brings up another funny way ML can beat humans—by not learning from human labels very well. 

When training examples are labeled inconsistently, an A.I. that beats HLP on the test set might not actually perform better than humans in practice. Take speech recognition. If humans transcribing an audio clip were to label the same speech disfluency “um” (a U.S. version) 70 percent of the time and “erm” (a U.K. variation) 30 percent of the time, then HLP would be low. Two randomly chosen labelers would agree only 58 percent of the time (0.72 + 0.33). An A.I. model could gain a statistical advantage by picking “um” all of the time, which would be consistent with 70 percent of the time with the human-supplied label. Thus, the A.I. would beat HLP without being more accurate in a way that matters.

The scenario that Andrew draws out doesn’t seem very plausible. But the broader point about thinking hard about cases which humans are not able to label consistently is an important one and worth building systems around.

Too Much Churn: Estimating Customer Churn

18 Nov

A new paper uses financial transaction data to estimate customer churn in consumer-facing companies. The paper defines churn as follows:

There are three concerns with the definition:

  1. The definition doesn’t make clear what is the normalizing constant for calculating the share. Given that the value “can vary between zero and one,” presumably the normalizing constant is either a) total revenue in the same year in which customer buys products, b) total revenue in the year in which the firm revenue was greater.
  2. If the denominator when calculating s_fit is the total revenue in the same year in which the customer buys products from the company, it can create a problem. Consider a case where there is a customer that spends $10 in both year t and year t-k. And assume that the firm’s revenue in the same years is $10 and $20 respectively. In this case, the customer hasn’t changed his/her behavior but their share has gone from 1 to .5.
  3. Beyond this, there is a semantic point. Churn is generally used to refer to attrition. In this case, it covers both customer acquisition and attrition. It also covers both a reduction and an increase in customer spending.

A Fun Aside

“Netflix similarly was not in one of our focused consumer-facing industries according to our SIC classification (it is found with two-digit SIC of 78, which mostly contains movie producers)” — this tracks with my judgment of Netflix.

Fat Or Not: Toward ‘Proper Training of DL Models’

16 Nov

A new paper introduces a DL model to enable ‘computer aided diagnosis of obesity.’ Some concerns:

  1. Better baselines: BMI is easy to calculate and it would be useful to compare the results to BMI.
  2. Incorrect statement: The authors write: “the data partition in all the image sets are balanced with 50 % normal classes and 50 % obese classes for proper training of the deep learning models.” (This ought not to affect the results reported in the paper.)
  3. Ignoring Within Person Correlation: The paper uses data from 100 people (50 fat, 50 healthy) and takes 647 images of them (310 obese). It then uses data augmentation to expand the dataset to 2.7k images. But in doing the train/test split, there is no mention of splitting by people, which is the right thing to do.

    Start with the fact that you won’t see the people in your training data again when you put the model in production. If you don’t split train/test by people, it means that the images of the people in the training set are also in the test set. This means that the test set accuracy is likely higher than if you would run it on a fresh sample.

Casual Inference: Errors in Everyday Causal Inference

12 Aug

Why are things the way they are? What is the effect of something? Both of these reverse and forward causation questions are vital.

When I was at Stanford, I took a class with a pugnacious psychometrician, David Rogosa. David had two pet peeves, one of which was people making causal claims with observational data. And it is in David’s class that I learned the pejorative for such claims. With great relish, David referred to such claims as ‘casual inference.’ (Since then, I have come up with another pejorative phrase for such claims—cosal inference—as in merely dressing up as causal inference.)

It turns out that despite its limitations, casual inference is quite common. Here are some fashionable costumes:

  1. 7 Habits of Successful People: We have all seen business books with such titles. The underlying message of these books is: adopt these habits, and you will be successful too! Let’s follow the reasoning and see where it falls apart. One stereotype about successful people is that they wake up early. And the implication is you wake up early you can be successful too. It *seems* right. It agrees with folk wisdom that discomfort causes success. But can we reliably draw inferences about what less successful people should do based on what successful people do? No. For one, we know nothing about the habits of less successful people. It could be that less successful people wake up *earlier* than the more successful people. Certainly, growing up in India, I recall daily laborers waking up much earlier than people living in bungalows. And when you think of it, the claim that servants wake up before masters seems uncontroversial. It may even be routine enough to be canonized as a law—the Downtown Abbey law. The upshot is that when you select on the dependent variable, i.e., only look at cases where the variable takes certain values, e.g., only look at the habits of financially successful people, even correlation is not guaranteed. This means that you don’t even get to mock the claim with the jibe that “correlation is not causation.”

    Let’s go back to Goji’s delivery service for another example. One of the ‘tricks’ that we had discussed was to sample failures. If you do that, you are selecting on the dependent variable. And while it is a good heuristic, it can lead you astray. For instance, let’s say that most of the late deliveries our early morning deliveries. You may infer that delivering at another time may improve outcomes. Except, when you look at the data, you find that the bulk of your deliveries are in the morning. And the rate at which deliveries run late is *lower* early morning than during other times.

    There is a yet more famous example of things going awry when you select on the dependent variable. During World War II, statisticians were asked where the armor should be added on the planes. Of the aircraft that returned, the damage was concentrated in a few areas, like the wings. The top-of-head answer is to suggest we reinforce areas hit most often. But if you think about the planes that didn’t return, you get to the right answer, which is that we need to reinforce areas that weren’t hit. In literature, people call this kind of error, survivorship bias. But it is a problem of selecting on the dependent variable (whether or not a plane returned) and selecting on planes that returned.

  2. More frequent system crashes cause people to renew their software license. It is a mistake to treat correlation as causation. There are many different reasons behind why doing so can lead you astray. The rarest reason is that lots of odd things are correlated in the world because of luck alone. The point is hilariously illustrated by a set of graphs showing a large correlation between conceptually unrelated things, e.g., there is a large correlation between total worldwide non-commercial space launches and the number of sociology doctorates that are awarded each year.

    A more common scenario is illustrated by the example in the title of this point. Commonly, there is a ‘lurking’ or ‘confounding’ variable that explains both sides. In our case, the more frequently a person uses a system, the more the number of crashes. And it makes sense that people who use the system most frequently also need the software the most and renew the license most often.

    Another common but more subtle reason is called Simpson’s paradox. Sometimes the correlation you see is “wrong.” You may see a correlation in the aggregate, but the correlation runs the opposite way when you break it down by group. Gender bias in U.C. Berkeley admissions provides a famous example. In 1973, 44% of the men who applied to graduate programs were admitted, whereas only 35% of the women were. But when you split by department, which eventually controlled admissions, women generally had a higher batting average than men. The reason for the reversal was women applied more often to more competitive departments, like—-wait for it—-English and men were more likely to apply to less competitive departments like Engineering. None of this is to say that there isn’t bias against women. It is merely to point out that the pattern in aggregated data may not hold when you split the data into relevant chunks.

    It is also important to keep in mind the opposite of correlation is not causation—lack of correlation does not imply a lack of causation.

  3. Mayor Giuliani brought the NYC crime rate down. There are two potential errors here:
    • Forgetting about ecological trends. Crime rates in other big US cities went down at the same time as they did in NY, sometimes more steeply. When faced with a causal claim, it is good to check how ‘similar’ people fared. The Difference-in-Differences estimator that builds on this intuition.
    • Treating temporally proximate as causal. Say you had a headache, you took some medicine and your headache went away. It could be the case that your headache went away by itself, as headaches often do.

  4. I took this homeopathic medication and my headache went away. If the ailments are real, placebo effects are a bit mysterious. And mysterious they may be but they are real enough. Not accounting for placebo effects misleads us to ascribe the total effect to the medicine. 

  5. Shallow causation. We ascribe too much weight to immediate causes than to causes that are a few layers deeper.

  6.  Monocausation: In everyday conversations, it is common for people to speak as if x is the only cause of y.

  7.  Big Causation: Another common pitfall is reading x causes y as x causes y to change a lot. This is partly a consequence of mistaking statistical significance with substantive significance, and partly a consequence of us not paying close enough attention to numbers.

  8. Same Effect: Lastly, many people take causal claims to mean that the effect is the same across people. 

99 Problems: How to Solve Problems

7 Jun

“Data is the new oil,” according to Clive Humby. But we have yet to build an engine that uses the oil efficiently and doesn’t produce a ton of soot. Using data to discover and triage problems is especially polluting. Working with data for well over a decade, I have learned some tricks that produce less soot and more light. Here’s a synopsis of a few things that I have learned.

  1. Is the Problem Worth Solving? There is nothing worse than solving the wrong problem. You spend time and money and get less than nothing in return—you squander the opportunity to solve the right problem. So before you turn to solutions, find out if the problem is worth solving.

    To illustrate the point, let’s follow Goji. Goji runs a delivery business. Goji’s business has an apparent problem. The company’s couriers have a habit of delivering late. At first blush, it seems like a big problem. But is it? To answer that, one good place to start is by quantifying how late the couriers arrive. Let’s say that most couriers arrive within 30 minutes of the appointment time. It seems promising but we still can’t tell whether it is good or bad. To find out, we could ask the customers. But asking customers is a bad idea. Even if the customers don’t care about their deliveries running late, it doesn’t cost them a dime to say that they care. Finding out how much they care is better. Find out the least amount of money the customers will happily accept in lieu of you running 30 minutes to the delivery. It may turn out that most customers don’t care—they will happily accept some trivial amount in lieu of a late delivery. Or it may turn out that customers only care when you deliver frozen or hot food. This still doesn’t give you the full picture. To get yet more clarity on the size of the problem, check how your price adjusted quality compares to other companies.

    Misestimating what customers will pay for something is just one of the ways to the wrong problem. Often, the apparent problem is merely an artifact of the measurement error. For instance, it may be that we think the couriers arrive late because our mechanism for capturing arrival is imperfect—couriers deliver on time but forget to tap the button acknowledging they have delivered. Automated check-in based on geolocation may solve the problem. Or incentivizing couriers to be prompt may solve it. But either way, the true problem is not late arrivals but mismeasurement.

    Wrong problems can be found in all parts of problem-solving. During software development, for instance, “[p]rogrammers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs,” according to Donald Knuth. (Knuth called the tendency “premature optimization.”) Worse, Knuth claims that “these attempts at efficiency actually ha[d] a strong negative impact” on how maintainable the code is.

    Often, however, you are not solving the wrong problem. You are just solving it at the wrong time. The conventional workflow of problem-solving is discovery, estimating opportunity, estimating investment, prioritizing, execution, and post-execution discovery, where you begin again. To find out what to focus on now, you need to get till prioritization. There are some rules of thumb, however, that can help you triage. 1. Fix upstream problems before downstream problems. The fixes upstream may make the downstream improvements moot. 2. diff the investment and returns based on optimal future workflow. If you don’t do that, you are committing to scrapping later a lot of what you build today. 3. Even on the best day, estimating the return on investment is a single decimal science. 4. You may find that there is no way to solve the problem with the people you have.
  1. MECE: Management consultants swear by it, so it can’t be a good idea. Right? It turns out that it is. Relentlessly working to pare down the problem into independent parts is among the most important tricks of the trade. Let’s see it in action. After looking at the data, Goji finds that arriving late is a big problem. So you know that it is the right problem but don’t know why your couriers are failing. You apply MECE. You reason that it could be because you have ‘bad’ couriers. Or because you are setting good couriers up for failure. These mutually exclusive comprehensively exhaustive parts can be broken down further. In fact, I think there is a law: the number of independent parts that a problem can be pared down is always one more than you think it is. Here, for instance, you may be setting couriers up to fail by giving them too little lead time or by not providing them precise directions. If you go down yet another layer, the short lead time may be a result of you taking too long to start looking for a courier or because it takes you a long time to find the right courier. So on and so forth. There is no magic to this. But there is no science to it either. MECE tells you what to do but not how to do it. We discuss how to in subsequent points.

  2. Funnel or the Plinko: The layered approach to MECE reminds most data scientists of the ‘funnel.’ Start with 100% and draw your Sankey diagram, popularized by Minard’s Napolean goes to Russia.

    Funnels are powerful tools capturing two important aspects: how much do we lose in each step, and where the losses come from. There is, however, one limitation of funnels—the need for categorical variables. When you have continuous variables, you need to decide smartly about how to discretize. Following the example we have been using, the heads-up we give to our couriers to pick something and deliver to the customer is one such continuous variable. Rather than break it into arbitrarily granular chunks, it is better to plot how lateness varies by lead time and then categorize at places where the slope changes dramatically.

    There are three things to be cautious about when building and using funnels. The first is that funnels treat correlation as causation. The second is Simpson’s paradox which deals with issues of aggregation in observational data. And the third is how coarseness of the funnel can lead to mistaken inferences. For instance, you may not see the true impact of having too little time to find a courier because you raise the prices where you have too little time.

  3. Systemic Thinking: It pays to know how the cookie is baked. Learn how the data flows through the system and what decisions we make at what point with what data and what assumptions to what purpose. The conventional tools are flow chart and process tracing. Keeping with our example, say we have a system that lets customers know when we are running late. And let’s assume that not only do we struggle to arrive on time, we also struggle to let people know when we are running late. An engineer may split the problem into an issue with detection or an issue with communication. The detection system may be made up of measuring where the courier is and estimating the time it takes to get to the destination. And either may be broken. And communication issues may be stem from problems with sending emails or issues with delivery, e.g., email being flagged as spam.

  4. Sample Failures: One way to diagnose problems is to look at a few examples closely. This is a good way to understand what could go wrong. For instance, it may allow you to discover that the locations you are getting from the couriers are wrong because the locations received a minute apart are hundreds of miles apart. This can then lead you to the diagnosis that your application is installed on multiple devices, and you cannot distinguish between data emitted by various devices.

  5. Worst Case: When looking at examples, start with the worst errors. The intuition is simple: worst errors are often the sites for obvious problems.

  6. Correlation is causation. To gain more traction, compare the worst with the best. Doing that allows you to see what is different between the two. The underlying idea is, of course, treating correlation as causation. And that is a famous warning. But often enough, correlation points in the right direction.

  7. Exploit the Skew: The Pareto principle—the 80/20 rule—holds in many places. Look for it. Rather than solve the entire pie, check if the opportunity is concentrated in small places. It often is. Pursuing our example above, it could be that a small proportion of our couriers account for a majority of the late deliveries. Or it could be that a small number of incorrect addresses our causing most of our late deliveries by waylaying couriers.

  8. Under good conditions, how often do we fail? How do you know how big of an issue a particular problem is? Say, for instance, you want to learn how big a role bad location data plays in our ability to notify. To do that, you should filter to cases where you have great location data and then see how well you can do. And then figure out the proportion of cases where we have great location data.

  9. Dr. House: The good doctor was a big believer in differential diagnosis. Dr. House often eliminated potential options by evaluating how patients responded to different treatment regimens. For instance, he would put people on an antibiotic course to eliminate infection as an option. The more general strategy is experimentation: learn by doing something.

    Experimentation is a sine-qua-non where people are involved. The impact of code is easy to simulate. But we cannot answer how much paying $10 per on-time delivery will increase on-time delivery. We need to experiment.