Self-Recommending: The Origins of Personalization

6 Jul

Recommendation systems are ubiquitous. They determine what videos and news you see. If asked about the origins of personalization, my hunch is that some of us will pin them to the advent of the Netflix Prize. Wikipedia goes further back—it puts the first use of the term in 1990. But the history of personalization is much older. It is at least as old as heterogeneous treatment effects. I don’t know for how long we have known about heterogeneous treatment effects but it can be no later than 1957 (Cronbach and Goldine Gleser, 1957).  

Here’s Ed Haertel:

“I remember some years ago when NetFlix founder Reed Hastings sponsored a contest (with a cash prize) for data analysts to come up with improvements to their algorithm for suggesting movies subscribers might like, based on prior viewings. (I don’t remember the details.) A primitive version of the same problem, maybe just a seed of the idea, might be discerned in the old push in educational research to identify “aptitude-treatment interactions” (ATIs). ATI research was predicated on the notion that to make further progress in educational improvement, we needed to stop looking for uniformly better ways to teach, and instead focus on the question of what worked for whom (and under what conditions). Aptitudes were conceived as individual differences in preparation to profit from future learning (of a given sort). The largely debunked notion of “learning styles” like a visual learner, auditory learner, etc., was a naïve example. Treatments referred to alternative ways of delivering instruction. If one could find a disordinal interaction, such that one treatment was optimum for learners in one part of an aptitude continuum and a different treatment was optimum in another region of that continuum, then one would have a basis for differentiating instruction. There are risks with this logic, and there were missteps and misapplications of the idea, of course. Prescribing different courses of instruction for different students based on test scores can easily lead to a tracking system where high performing students are exposed to more content and simply get further and further ahead, for example, leading to a pernicious, self-fulfilling prophecy of failure for those starting out behind. There’s a lot of history behind these ideas. Lee Cronbach proposed the ATI research paradigm in a (to my mind) brilliant presidential address to the American Psychological Association, in 1957. In 1974, he once again addressed the American Psychological Association, on the occasion of receiving a Distinguished Contributions Award, and in effect said the ATI paradigm was worth a try but didn’t work as it had been conceived. (That address was published in 1975.)

This episode reminded me of the “longstanding principle in statistics, which is that, whatever you do, somebody in psychometrics already did it long before. I’ve noticed this a few times.”

Reading Cronbach today is also sobering in a way. It shows how ad hoc the investigation of theories and coming up with the right policy interventions was.

Interacting With Human Decisions

29 Jun

In sport, as in life, luck plays a role. For instance, in cricket, there is a toss at the start of the game. And the team that wins the toss wins the game 3% more often. The estimate of the advantage from winning the toss, however, is likely an underestimate of the maximum potential benefit of winning the toss. The team that wins the toss gets to decide whether to bat or bowl first. And 3% reflects the maximum benefit only when the team that won the toss chooses optimally.

The same point applies to estimates of heterogeneity. Say that you estimate how the probability of winning varies by the decision to bowl or bat first after winning the toss. (The decision to bowl or bat first is made before the toss.) And say, 75% of the time team that wins the toss chooses to bat first and wins these games 55% of the time. 25% of the time, teams decide to bowl first and win about 47% of these games. Winning rates of 55% and 47% would be likely yet higher if the teams chose optimally.

In the absence of other data, heterogeneous treatment effects give clear guidance on where the payoffs are higher. For instance, if you find that showing an ad on Chrome has a larger treatment effect, barring other information (and concerns), you may want to only show ads to people who use Chrome to increase the treatment effect. But the decision to bowl or bat first is not a traditional “covariate.” It is a dummy that captures the human judgment about pre-match observables. The interpretation of the interaction term thus needs care. For instance, in the example above, the winning percentage of 47% for teams that decide to bowl first looks ‘wrong’—how can the team that wins the toss lose more often than win in some cases? Easy. It can happen because the team decides to bowl in cases where the probability of winning is lower than 47%. Or it can be that the team is making a bad decision when opting to bowl first. 

Solving Problem Solving: Meta Skills For Problem Solving

21 Jun

Each problem is new in different ways. And mechanically applying specialized tools often doesn’t take you far. So beyond specialized tools, you need meta-skills.

The top meta-skill is learning. Immersing yourself in the area you are thinking about will help you solve problems better and quicker. Learning more broadly helps as well—it enables you to connect dots arrayed in unusual patterns.

Only second to learning is writing. Writing works because it is an excellent tool for thinking. Humans have limited memories, finite processing capacity, are overconfident, and are subject to ‘passions’ of the moment that occlude thinking. Writing reduces the malefic effects of these deficiencies.

By incrementally writing things down, you no longer have to store everything in the brain. Having a written copy also means that you can repeatedly go over the contents, which makes focusing on each of the points easier. But having something written also means you can `scan’ more quickly. Writing down, thus, also allows you to mix and match and form new combinations more easily.

Just as writing overcomes some of the limitations of our memory, it also improves our computational power. Writing allows us to overcome finite processing capacity by spreading the computation over time—run Intel 8088 for a long time, and you can solve reasonably complex problems.

Not all writing, however, will reduce overconfidence or overcome fuzzy thinking. For that, you need to write with the aim of genuine understanding and have enough humility, skepticism, motivation, and patience to see what you don’t know, learn what you don’t know, and apply what you have learned.

To make the most of writing, spread the writing over time. By distancing yourself from `passions’ of the moment—egoism, being enamored with an idea, etc.—you can see more clearly. So spread writing over time to see your words with a ‘fresh pair of eyes.’

The third meta-skill is talking. Like writing is not transcribing, talking is not recitation. If you don’t speak, some things will remain unthought. So speak to people. And there is no better set of people to talk to than a diverse set of others, people who challenge your implicit assumptions and give you new ways to think about a problem.

There are tricks to making discussions more productive. The first is separating discussions of problems from solutions and separating discussions about alternate solutions from discussions about which solution is better. There are compelling reasons behind the suggestion. If you kludge discussions of problems with solutions, people are liable to confuse unworkable solutions with problems. The second is getting opinions from the least powerful first—they are liable to defer to the more powerful. The third is keeping the tenor of discussion as ”intellectual pursuit of truth,” where getting it right is the only aim.

The fouth meta-skill, implicit in the third meta-skill but a separate skill, is relying on others. How we overcome our limitations is by relying on others. Knowing how to ask for help is an important skill. Find ways to get help—ask people to read what you have written, offer comments, ask them why you are wrong, how they would solve the problem, point you to literature, other people, etc.

99 Problems: How to Solve Problems

7 Jun

“Data is the new oil,” according to Clive Humby. But we have yet to build an engine that uses the oil efficiently and doesn’t produce a ton of soot. Using data to discover and triage problems is especially polluting. Working with data for well over a decade, I have learned some tricks that produce less soot and more light. Here’s a synopsis of a few things that I have learned.

  1. Is the Problem Worth Solving? There is nothing worse than solving the wrong problem. You spend time and money and get less than nothing in return—you squander the opportunity to solve the right problem. So before you turn to solutions, find out if the problem is worth solving.

    To illustrate the point, let’s follow Goji. Goji runs a delivery business. Goji’s business has an apparent problem. The company’s couriers have a habit of delivering late. At first blush, it seems like a big problem. But is it? To answer that, one good place to start is by quantifying how late the couriers arrive. Let’s say that most couriers arrive within 30 minutes of the appointment time. It seems promising but we still can’t tell whether it is good or bad. To find out, we could ask the customers. But asking customers is a bad idea. Even if the customers don’t care about their deliveries running late, it doesn’t cost them a dime to say that they care. Finding out how much they care is better. Find out the least amount of money the customers will happily accept in lieu of you running 30 minutes to the delivery. It may turn out that most customers don’t care—they will happily accept some trivial amount in lieu of a late delivery. Or it may turn out that customers only care when you deliver frozen or hot food. This still doesn’t give you the full picture. To get yet more clarity on the size of the problem, check how your price adjusted quality compares to other companies.

    Misestimating what customers will pay for something is just one of the ways to the wrong problem. Often, the apparent problem is merely an artifact of the measurement error. For instance, it may be that we think the couriers arrive late because our mechanism for capturing arrival is imperfect—couriers deliver on time but forget to tap the button acknowledging they have delivered. Automated check-in based on geolocation may solve the problem. Or incentivizing couriers to be prompt may solve it. But either way, the true problem is not late arrivals but mismeasurement.

    Wrong problems can be found in all parts of problem-solving. During software development, for instance, “[p]rogrammers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs,” according to Donald Knuth. (Knuth called the tendency “premature optimization.”) Worse, Knuth claims that “these attempts at efficiency actually ha[d] a strong negative impact” on how maintainable the code is.

    Often, however, you are not solving the wrong problem. You are just solving it at the wrong time. The conventional workflow of problem-solving is discovery, estimating opportunity, estimating investment, prioritizing, execution, and post-execution discovery, where you begin again. To find out what to focus on now, you need to get till prioritization. There are some rules of thumb, however, that can help you triage. 1. Fix upstream problems before downstream problems. The fixes upstream may make the downstream improvements moot. 2. diff the investment and returns based on optimal future workflow. If you don’t do that, you are committing to scrapping later a lot of what you build today. 3. Even on the best day, estimating the return on investment is a single decimal science. 4. You may find that there is no way to solve the problem with the people you have.
  1. MECE: Management consultants swear by it, so it can’t be a good idea. Right? It turns out that it is. Relentlessly working to pare down the problem into independent parts is among the most important tricks of the trade. Let’s see it in action. After looking at the data, Goji finds that arriving late is a big problem. So you know that it is the right problem but don’t know why your couriers are failing. You apply MECE. You reason that it could be because you have ‘bad’ couriers. Or because you are setting good couriers up for failure. These mutually exclusive comprehensively exhaustive parts can be broken down further. In fact, I think there is a law: the number of independent parts that a problem can be pared down is always one more than you think it is. Here, for instance, you may be setting couriers up to fail by giving them too little lead time or by not providing them precise directions. If you go down yet another layer, the short lead time may be a result of you taking too long to start looking for a courier or because it takes you a long time to find the right courier. So on and so forth. There is no magic to this. But there is no science to it either. MECE tells you what to do but not how to do it. We discuss how to in subsequent points.

  2. Funnel or the Plinko: The layered approach to MECE reminds most data scientists of the ‘funnel.’ Start with 100% and draw your Sankey diagram, popularized by Minard’s Napolean goes to Russia.

    Funnels are powerful tools capturing two important aspects: how much do we lose in each step, and where the losses come from. There is, however, one limitation of funnels—the need for categorical variables. When you have continuous variables, you need to decide smartly about how to discretize. Following the example we have been using, the heads-up we give to our couriers to pick something and deliver to the customer is one such continuous variable. Rather than break it into arbitrarily granular chunks, it is better to plot how lateness varies by lead time and then categorize at places where the slope changes dramatically.

    There are three things to be cautious about when building and using funnels. The first is that funnels treat correlation as causation. The second is Simpson’s paradox which deals with issues of aggregation in observational data. And the third is how coarseness of the funnel can lead to mistaken inferences. For instance, you may not see the true impact of having too little time to find a courier because you raise the prices where you have too little time.

  3. Systemic Thinking: It pays to know how the cookie is baked. Learn how the data flows through the system and what decisions we make at what point with what data and what assumptions to what purpose. The conventional tools are flow chart and process tracing. Keeping with our example, say we have a system that lets customers know when we are running late. And let’s assume that not only do we struggle to arrive on time, we also struggle to let people know when we are running late. An engineer may split the problem into an issue with detection or an issue with communication. The detection system may be made up of measuring where the courier is and estimating the time it takes to get to the destination. And either may be broken. And communication issues may be stem from problems with sending emails or issues with delivery, e.g., email being flagged as spam.

  4. Sample Failures: One way to diagnose problems is to look at a few examples closely. This is a good way to understand what could go wrong. For instance, it may allow you to discover that the locations you are getting from the couriers are wrong because the locations received a minute apart are hundreds of miles apart. This can then lead you to the diagnosis that your application is installed on multiple devices, and you cannot distinguish between data emitted by various devices.

  5. Worst Case: When looking at examples, start with the worst errors. The intuition is simple: worst errors are often the sites for obvious problems.

  6. Correlation is causation. To gain more traction, compare the worst with the best. Doing that allows you to see what is different between the two. The underlying idea is, of course, treating correlation as causation. And that is a famous warning. But often enough, correlation points in the right direction.

  7. Exploit the Skew: The Pareto principle—the 80/20 rule—holds in many places. Look for it. Rather than solve the entire pie, check if the opportunity is concentrated in small places. It often is. Pursuing our example above, it could be that a small proportion of our couriers account for a majority of the late deliveries. Or it could be that a small number of incorrect addresses our causing most of our late deliveries by waylaying couriers.

  8. Under good conditions, how often do we fail? How do you know how big of an issue a particular problem is? Say, for instance, you want to learn how big a role bad location data plays in our ability to notify. To do that, you should filter to cases where you have great location data and then see how well you can do. And then figure out the proportion of cases where we have great location data.

  9. Dr. House: The good doctor was a big believer in differential diagnosis. Dr. House often eliminated potential options by evaluating how patients responded to different treatment regimens. For instance, he would put people on an antibiotic course to eliminate infection as an option. The more general strategy is experimentation: learn by doing something.

    Experimentation is a sine-qua-non where people are involved. The impact of code is easy to simulate. But we cannot answer how much paying $10 per on-time delivery will increase on-time delivery. We need to experiment.

Trump Trumps All: Coverage of Presidents on Network Television News

4 May

With Daniel Weitzel.

The US government is a federal system, with substantial domains reserved for local and state governments. For instance, education, most parts of the criminal justice system, and a large chunk of regulation are under the purview of the states. Further, the national government has three co-equal branches: legislative, executive, and judicial. Given these facts, you would expect news coverage to be broad in its coverage of branches and the levels of government. But there is a sharp skew in news coverage of politicians, with members of the executive branch, especially national politicians (and especially the president), covered far more often than other politicians (see here). Exploiting data from Vanderbilt Television News Archive (VTNA), the largest publicly available database of TV news—over 1M broadcast abstracts spanning 1968 and 2019—we add body to the observation. We searched for references to the president during their presidency and coded each hit as 1. As the figure below shows, references to the president are common. Excluding Trump, on average, a sixth of all articles contain a reference to the sitting president. But Trump is different: 60%(!) of abstracts refer to Trump.

Data and scripts can be found here.

Trading On Overconfidence

2 May

In Thinking Fast and Slow, Kahneman recounts a time when Thaler, Amos, and he met a senior investment manager in 1984. Kahneman asked, “When you sell a stock, who buys it?”

“[The investor] answered with a wave in the vague direction of the window, indicating that he expected the buyer to be someone else very much like him. That was odd: What made one person buy, and the other person sell? What did the sellers think they knew that the buyers did not? [gs: and vice versa.]”

“… It is not unusual for more than 100M shares of a single stock to change hands in one day. Most of the buyers and sellers know that they have the same information; they exchange the stocks primarily because they have different opinions. The buyers think the price is too low and likely to rise, while the sellers think the price is high and likely to drop. The puzzle is why buyers and sellers alike think that the current price is wrong. What makes them believe they know more about what the price should be than the market does? For most of them, that belief is an illusion.”

Thinking Fast and Slow. Daniel Kahneman

Note: Kahneman is not just saying that buyers and sellers have the same information but that they also know they have the same information.

There is a 1982 counterpart to Kahneman’s observation in the form of Paul Milgrom and Nancy Stokey’s paper on the No-Trade Theorem. “[If] [a]ll the traders in the market are rational, and thus they know that all the prices are rational/efficient; therefore, anyone who makes an offer to them must have special knowledge, else why would they be making the offer? Accepting the offer would make them a loser. All the traders will reason the same way, and thus will not accept any offers.”

From Lives Lost to Years Lost

2 Apr

The mortality rate is puzzling to mortals. A better number is the expected number of years lost. (A yet better number would be quality-adjusted years lost.) To make it easier to calculate the expected years lost, Suriyan and I developed a Python package that uses the SSA actuarial data and life table to estimate the expected years lost.

We illustrate the use of the package by estimating the average number of years by which people’s lives are shortened due to coronavirus. Using data from Table 1 of the paper that gives us the distribution of ages of people who died from COVID-19 in China, with conservative assumptions (assuming the gender of the dead person to be male, taking the middle of age ranges) we find that people’s lives are shortened by about 11 years on average. These estimates are conservative for one additional reason: there is likely an inverse correlation between people who die and their expected longevity. And note that given a bulk of the deaths are among older people, when people are more infirm, the quality-adjusted years lost is likely yet more modest. Given that the last life tables from China are from 1981 and given life expectancy in China has risen substantially since then (though most gains come from reductions in childhood mortality, etc.), we exploit the recent data from the US, assuming as-if people have the same life tables as Americans. Using the most recent SSA data, we find that the number to be 16. Compare this to deaths from road accidents, the modal reason for death among 5-24, and 25-44 ages in the US. Assuming everyone who dies from a traffic accident is a man, and assuming the age of death to be 25, we get ~52 years, roughly 3x as large as coronavirus.

The Puzzle of Price Dispersion on Amazon

29 Mar

Price dispersion is an excellent indicator of transactional frictions. It isn’t that absent price dispersion, we can confidently say that frictions are negligible. Frictions can be substantial even when price dispersion is zero. For instance, if the search costs are high enough that it makes it irrational to search, all the sellers will price the good at the buyer’s Willingness To Pay (WTP). Third world tourist markets, which are full of hawkers selling the same thing at the same price, are good examples of that. But when price dispersion exists, we can be reasonably sure that there are frictions in transacting. This is what makes the existence of substantial price dispersion on Amazon compelling.

Amazon makes price discovery easy, controls some aspects of quality by kicking out sellers who don’t adhere to its policies and provides reasonable indicators of quality of service with its user ratings. But still, on nearly all items that I looked at, there was substantial price dispersion. Take, for instance, the market for a bottle of Nature Made B12 vitamins. Prices go from $8.40 to nearly $30! With taxes, the dispersion is yet greater. If the listing costs are non-zero, it is not immediately clear why sellers selling the product at $30 are in the market. It could be that the expected service quality for the $30 seller is higher except that between the highest price seller and the next highest price seller, the ratings of the highest price seller are lower (take a look at shipping speed as well). And I would imagine that the ratings (and the quality) of Amazon, which comes in with the lowest price, are the highest. More generally, I have a tough time thinking about aspects of service and quality that are worth so much that the range of prices goes from 1x to 4x for a branded bottle of vitamin pills.

One plausible explanation is that the lowest price seller has a non-zero probability of being out of stock. And the more expensive and worse-quality sellers are there to catch these low probability events. They set a price that is profitable for them. One way to think about it is that the marginal cost of additional supply rises in the way the listed prices show. If true, then there seems to be an opportunity to make money. And it is possible that Amazon is leaving money on the table.

p.s. Sales of the boxed set of Harry Potter shows a similar pattern.

It Pays to Search

28 Mar

In Reinventing the Bazaar, John McMillan discusses how search costs affect the price the buyer pays. John writes:

“Imagine that all the merchants are quoting $1[5]. Could one of them do better by undercutting this price? There is a downside to price-cutting: a reduction in revenue from any customers who would have bought from this merchant even at the higher price. If information were freely available, the price-cutter would get a compensating boost in sales as additional customers flocked in. When search costs exist, however, such extra sales may be negligible. If you incur a search cost of 10 cents or more for each merchant you sample, and there are fifty sellers offering the urn, then even if you know there is someone out there who is willing to sell it at cost, so you would save $5, it does not pay you to look for him. You would be looking for a needle in a haystack. If you visited one more seller, you would have a chance of one in fifty of that seller being the price-cutter, so the return on average from that extra price quote would be 10 cents (or $5 multiplied by 1/50), which is the same as your cost of getting one more quote. It does not pay to search.”

Reinventing the Bazaar, John McMillan

John got it wrong. It pays to search. The cost and the expected payoff for the first quote is 10 cents. But if the first quote is $15, the expected payoff for the second quote—(1/49)*$50—is greater than 10 cents. And so on.

Another way to solve for it is to come up with the expected number of quotes you need to get to get to the seller selling at $10. It is 25. Given you need to spend on average $2.50 to get a benefit of $2.50, you will gladly search.

Yet another way to think is that the worst case is that you make no money—when the $10 seller is the last one you get a quote from. But in every other case, you make money.

For the equilibrium price, you need to make assumptions. But if the buyer knows that there is a price cutter, they will all buy from him. This means that the price cutter will be the only seller remaining.

There are two related fun points. First, one of the reasons markets are competitive on price when true search costs are high is likely because people price their time remarkably low. Second, when people spend a bunch of time looking for the cheapest deal, you incentivize all the sellers selling at a high rate to lower their rates and make it better for everyone else.

Good News: Principles of Good Journalism

12 Mar

If fake news—deliberate disinformation, not uncongenial news—is one end of the spectrum, what is the other end of the spectrum?

To get at the question, we need a theory of what news should provide. A theory of news, in turn, needs a theory of citizenship, which prescribes the information people need to execute their role, and an empirically supported behavioral theory of how people get that information.

What a democracy expects of people varies by the conception of democracy. Some theories of democracy only require citizens to have enough information to pick the better candidate when differences in candidates are material. Others, like deliberative democracy, expect people to be well informed and to have thought through various aspects of policies.

I opt for deliberative democracy to guide expectations about people for two reasons. Not only does the theory best express the highest ideals of democracy, but it also has the virtue of answering a vital question well. If all news was equally profitable to produce and was as widely read, what kind of news would lead to the best political outcomes, as judged by idealized versions of people—people who have all the information and all the time to think through the issues?

There are two virtues of answering such a question. First, it offers a convenient place to start answering what we mean by ‘good’ news; we can bring in profitability and reader preferences later. Second, engaging with it uncovers some obvious aspects of ‘good’ news.

For news to positively affect political outcomes (not in the shallow, instrumental sense), the news has to be about politics. Rather than news about Kim Kardashian or opinions about the hottest boots this season, ‘good’ news is about policymaker, policy-implementor, and policy-relevant news.

News about politics is a necessary but not a sufficient condition. Switching from discussing Kim Kardashian’s dress to Hillary Clinton’s is very plausibly worse. Thus, we also want the news to be substantive, engaging with real issues rather than cosmetic concerns.

Substantively engaging with real issues is still no panacea. If the information is not correct, it will misinform than inform the debate. Thus, the third quality of ‘good’ news is correctness.

The criterion for “good” news is, however, not just correctness, but it is the correctness of interpretation. ‘Good’ news allows people to draw the right conclusions. For instance, reporting murder rates as say ‘a murder per hour’ without reporting the actual number of murders or comparing the probability of being murdered to other common threats to life may instill greater fear in people than ‘optimal.’ (Optimal, as judged by better-informed versions of ourselves who have been given time to think. We can also judge optimal by correctness—did people form unbiased, accurate beliefs after reading the news?)

Not all issues, however, lend themselves to objective appraisals of truth. To produce ‘good’ news, the best you can do is have the right process. The primary tool that journalists have in the production of news is the sources they use to report on stories. (While journalists increasingly use original data to report, the reliance on people is widespread.) Thus, the way to increase correctness is through manipulating aspects of sources. We can increase correctness by increasing the quality of sources, e.g., source more knowledgeable people with low incentives to cook the books, increase the diversity of sources, e.g., not just government officials but also plausibly major NGOs, and the number of sources.

If we emphasize correctness, we may fall short on timeliness. News has to be timely enough to be useful, aside from being correct enough to guide policy and opinion correctly.

News can be narrowly correct but may commit sins of omission. ‘Good’ news provides information on all sides of the issue. ‘Good’ news highlights and engages with all serious claims. It doesn’t give time to discredited claims for “balance.”

Second-to-last, news should be delivered in the right tone. Rather than speculative ad-hominem attacks, “good” news engages with arguments and data.

Lastly, news contributes to the public kitty only if it is original. Thus, ‘good’ news is original. (Plagiarism reduces the incentives for producing quality news because it eats into the profits.)