Too Much Churn: Estimating Customer Churn

18 Nov

A new paper uses financial transaction data to estimate customer churn in consumer-facing companies. The paper defines churn as follows:

There are three concerns with the definition:

  1. The definition doesn’t make clear what is the normalizing constant for calculating the share. Given that the value “can vary between zero and one,” presumably the normalizing constant is either a) total revenue in the same year in which customer buys products, b) total revenue in the year in which the firm revenue was greater.
  2. If the denominator when calculating s_fit is the total revenue in the same year in which the customer buys products from the company, it can create a problem. Consider a case where there is a customer that spends $10 in both year t and year t-k. And assume that the firm’s revenue in the same years is $10 and $20 respectively. In this case, the customer hasn’t changed his/her behavior but their share has gone from 1 to .5.
  3. Beyond this, there is a semantic point. Churn is generally used to refer to attrition. In this case, it covers both customer acquisition and attrition. It also covers both a reduction and an increase in customer spending.

A Fun Aside

“Netflix similarly was not in one of our focused consumer-facing industries according to our SIC classification (it is found with two-digit SIC of 78, which mostly contains movie producers)” — this tracks with my judgment of Netflix.

Not So Robust: The Limitations of “Doubly Robust” ATE Estimators

16 Nov

Doubly Robust (DR) estimators of ATE are all the rage. One popular DR estimator is Robins’ Augmented IPW (AIPW). The reason why Robins’ AIPW estimator is called doubly robust is that if either your IPW model or your y ~ x model is correctly specified, you get ATE. Great!

Calling something “doubly robust” (DR) makes you think that the estimator is robust to (common) violations of commonly made assumptions. But DR replaces one strong assumption with one marginally less strong assumption. It is common to assume that IPW or Y ~ X are right. But DR replaces either of these with the OR clause. So how common is it to get either of the models right? Basically never. If neither model is right, you multiply the bias terms. And that ought to blow up the bias.

(There is one more reason to worry about the use of word ‘robust.’ In statistics, it is used to convey robustness of to violations of distributional assumptions.)

Given the small advance in assumptions, it turns out that the results aren’t better either (and can be substantially worse):

  1. “None of the DR methods we tried … improved upon the performance of simple regression-based prediction of the missing values. (see here.)
  2. “The methods with by far the worst performance with regard to RSMSE are the Doubly Robust (DR) approaches, whose RSMSE is two or three times as large as the RSMSE for the other estimators.” (see here and the relevant table is included below.)
From Kern et al. 2016

Some people prefer DR for efficiency. But the claim for efficiency is based on strong assumptions being met: “The local semiparametric efficiency property, which guarantees that the solution to (9) is the best estimator within its class, was derived under the assumption that both models are correct. This estimate is indeed highly efficient when the π-model is true and the y-model is highly predictive.”

p.s. When I went through some of the lecture notes posted online, I was surprised that the lecture notes explain DR as “if A or B hold, we get ATE” but do not discuss the modal case.

Instrumental Music: When It Rains, It Pours

23 Oct

In a new paper, Jon Mellon reviews 185 papers that use weather as an instrument and finds that researchers have linked 137 variables to weather. You can read it as each paper needing to contend with 136 violations of the exclusion restriction, but the situation is likely less dire. For one, weather as an instrument has many varietals. Some papers use local (both in time and space) fluctuations in the weather for identification. At the other end, some use long-range (both in time and space) variations in weather, e.g., those wrought upon by climate. And the variables affected by each are very different. For instance, we don’t expect long-term ‘dietary diversity’ to be affected by short-term fluctuations in the local weather. A lot of the other variables are like that. For two, the weather’s potential pathways to the dependent variable of interest are often limited. For instance, as Jon notes, it is hard to imagine how rain on election day would affect government spending any other way except its effect on the election outcome. 

There are, however, some potential general mechanisms through which exclusion restriction could be violated. The first that Jon identifies is also among the oldest conjecture in social science research—weather’s effect on mood. Except that studies that purport to show the effect of weather on mood are themselves subject to selective response, e.g., when the weather is bad, more people are likely to be home, etc. 

There are some other more fundamental concerns with using weather as an instrument. First, when there are no clear answers on how an instrument should be (ahem!) instrumented, the first stage of IV is ripe for specification search. In such cases, people probably pick up the formulation that gives the largest F-stat. Weather falls firmly in this camp. For instance, there is a measurement issue about how to measure rain. Should it be the amount of rain or the duration of rain, or something else? And then there is a crudeness issue of the instrument as ideally, we would like to measure rain over every small geographic unit (of time and space). To create a summary measure from crude observations, we often need to make judgments, and it is plausible that judgments that lead to a larger F-stat. are seen as ‘better.’

Second, for instruments that are correlated in time, we need to often make judgments to regress out longer-term correlations. For instance, as Jon points out, studies that estimate the effect of rain on voting on election day may control long-term weather but not ‘medium term.’ “However, even short-term studies will be vulnerable to other mechanisms acting at time periods not controlled for. For instance, many turnout IV studies control for the average weather on that day of the year over the previous decade. However, this does not account for the fact that the weather on election day will be correlated with the weather over the past week or month in that area. This means that medium-term weather effects will still potentially confound short-term studies.”

The concern is wider and includes some of the RD designs that measure the effect of ad exposure on voting, etc.

Rent-seeking: Why It Is Better to Rent than Buy Books

4 Oct

It has taken me a long time to realize that renting books is the way to go for most books. The frequency with which I go back to a book is so low that I don’t really see any returns on permanent possession that accrue from the ability to go back.

Renting also has the virtue of disciplining me: I rent when I am ready to read and it incentives me to finish the book (or graze and assess whether the book is worth finishing) before the rental period expires.

For e-books, my format of choice, buying a book is even less attractive. One reason why people buy a book is for the social returns from displaying the book on a bookshelf. E-books don’t provide that, though in time people may devise mechanisms to do just that. Another reason why people prefer buying books is that they want something ‘new.’ Once again, the concern doesn’t apply to e-books.

From a seller’s perspective, renting has the advantage of expanding the market. Sellers get money from people who would otherwise not buy the book. These people may, instead, substitute it by copying the book or borrowing it from a friend or a library or getting similar content elsewhere, e.g., Youtube or other (cheaper) books, or they may simply forgo reading the book.

STEMing the Rot: Does Relative Deprivation Explain Low STEM Graduation Rates at Top Schools?

26 Sep

The following few paragraphs are from Sociation Today:


Using the work of Elliot (et al. 1996), Gladwell compares the proportion of each class which gets a STEM degree compared to the math SAT at Hartwick College and Harvard University.  Here is what he presents for Hartwick:

Students at Hartwick College

STEM MajorsTop ThirdMiddle ThirdBottom Third
Math SAT569472407
STEM degrees55.0%27.1%17.8

So the top third of students with the Math SAT as the measure earn over half the science degrees. 

    What about Harvard?   It would be expected that Harvard students would have much higher Math SAT scores and thus the distribution would be quite different.  Here are the data for Harvard:

Students at Harvard University

STEM MajorsTop ThirdMiddle ThirdBottom Third
Math SAT753674581
STEM degrees53.4%31.2%15.4%

     Gladwell states the obvious, in italics, “Harvard has the same distribution of science degrees as Hartwick,” p. 83. 

    Using his reference theory of being a big fish in a small pond, Gladwell asked Ms. Sacks what would have happened if she had gone to the University of Maryland and not Brown. She replied, “I’d still be in science,” p. 94.


Gladwell focuses on the fact that the bottom-third at Harvard is the same as the top third at Hartwick. And points to the fact that they graduate at very different rates. It is a fine point. But there is more to the data. The top-third at Harvard have much higher SAT scores than the top-third at Hartwick. Why is it the case that they graduate with a STEM degree at the same rate as the top-third at Hartwick? One answer to that is that STEM degrees at Harvard are harder. So harder coursework at Harvard (vis-a-vis Hartwick) is another explanation for the pattern we see in the data and, in fact, fits the data better as it explains the performance of the top-third at Harvard.

Here’s another way to put the point: If preferences for graduating in STEM are solely and almost deterministically explained by Math SAT scores, like Gladwell implicitly assumes, and the major headwinds are because of relative standing, then we should see a much higher STEM graduation rate for the top-third at Harvard. We should ideally see an intercept shift across schools, which we don’t see, but a common differential between the top and the bottom third.

Amartya Sen on Keynes, Robinson, Smith, and the Bengal Famine

17 Aug

Sen in conversation with Angus Deaton and Tim Besleypdf and video.

Excepts:

On Joan Robinson

“She took a position—which has actually become very popular in India
now, not coming from the left these days, but from the right—that what you have to concentrate on is simply maximizing economic growth. Once you have grown and become rich, then you can do health care, education, and all this other stuff. Which I think is one of the more profound errors that you can make in development planning. Somehow Joan had a lot of sympathy for that position. In fact, she strongly criticized Sri Lanka for offering highly subsidized food to everyone on nutritional grounds. I remember the phrase she used: “Sri Lanka is trying to taste the fruit of
the tree without growing it.”

Amartya Sen

On Keynes:

“On the unemployment issue I may well be, but if I compare an economist
like Keynes, who never took a serious interest in inequality, in poverty, in the environment, with Pigou, who took an interest in all of them, I don’t think I would be able to say exactly what you are asking me to say.”

Amartya Sen

On the 1943 Bengal Famine, the last big famine in India in which ~ 3M people perished:

“Basically I had figured out on the basis of the little information I had (that indeed
everyone had) that the problem was not that the British had the wrong data, but that their theory of famine was completely wrong. The government was claiming that there was so much food in Bengal that there couldn’t be a famine. Bengal, as a whole, did indeed have a lot of food—that’s true. But that’s supply; there’s also demand, which was going up and up rapidly, pushing prices sky-high. Those left behind in a boom economy—a boom generated by the war—lost out in the competition for buying food.”

“I learned also—which I knew as a child—that you could have a famine with a lot of food around. And how the country is governed made a difference. The British did not want rebellion in Calcutta. I believe no one of Calcutta died in the famine. People died in Calcutta, but they were not of Calcutta. They came from elsewhere, because what little charity there was came from Indian businessmen based in Calcutta. The starving people
kept coming into Calcutta in search of free food, but there was really not much of that. The Calcutta people were entirely protected by the Raj to prevent discontent of established people during the war. Three million people in Calcutta had ration cards, which entailed that at least six million people were being fed at a very subsidized price of food. What the government did was to buy rice at whatever price necessary to purchase it in the rural areas, making the rural prices shoot up. The price of rationed food in Calcutta for established residents was very low and highly subsidized, though the market price in Calcutta—outside the rationing network—rose with the rural price increase.”

Amartya Sen

On John Smith

“He discussed why you have to think pragmatically about the different institutions to be combined together, paying close attention to how they respectively work. There’s a passage where he’s asking himself the question, Why do we strongly want a good political economy? Why is it important? One answer—not the only one—is that it will lead to high economic growth (this is my language, not Smith’s). I’m not quoting his words, but he talks about the importance of high growth, high rate of progress. But why is that important? He says it’s important for two distinct reasons. First, it gives the individual more income, which in turn helps people to do what they would value doing. Smith is talking here about people having more capability. He doesn’t use the word capability, but that’s what he is talking about here. More income helps you to choose the kind of life that you’d like to lead. Second, it gives the state (which he greatly valued as an institution when properly used) more revenue, allowing it to do those things which only the state can do well. As an example, he talks about the state being able to provide free school education.”

Amartya Sen

Trading On Overconfidence

2 May

In Thinking Fast and Slow, Kahneman recounts a time when Thaler, Amos, and he met a senior investment manager in 1984. Kahneman asked, “When you sell a stock, who buys it?”

“[The investor] answered with a wave in the vague direction of the window, indicating that he expected the buyer to be someone else very much like him. That was odd: What made one person buy, and the other person sell? What did the sellers think they knew that the buyers did not? [gs: and vice versa.]”

“… It is not unusual for more than 100M shares of a single stock to change hands in one day. Most of the buyers and sellers know that they have the same information; they exchange the stocks primarily because they have different opinions. The buyers think the price is too low and likely to rise, while the sellers think the price is high and likely to drop. The puzzle is why buyers and sellers alike think that the current price is wrong. What makes them believe they know more about what the price should be than the market does? For most of them, that belief is an illusion.”

Thinking Fast and Slow. Daniel Kahneman

Note: Kahneman is not just saying that buyers and sellers have the same information but that they also know they have the same information.

There is a 1982 counterpart to Kahneman’s observation in the form of Paul Milgrom and Nancy Stokey’s paper on the No-Trade Theorem. “[If] [a]ll the traders in the market are rational, and thus they know that all the prices are rational/efficient; therefore, anyone who makes an offer to them must have special knowledge, else why would they be making the offer? Accepting the offer would make them a loser. All the traders will reason the same way, and thus will not accept any offers.”

The Puzzle of Price Dispersion on Amazon

29 Mar

Price dispersion is an excellent indicator of transactional frictions. It isn’t that absent price dispersion, we can confidently say that frictions are negligible. Frictions can be substantial even when price dispersion is zero. For instance, if the search costs are high enough that it makes it irrational to search, all the sellers will price the good at the buyer’s Willingness To Pay (WTP). Third world tourist markets, which are full of hawkers selling the same thing at the same price, are good examples of that. But when price dispersion exists, we can be reasonably sure that there are frictions in transacting. This is what makes the existence of substantial price dispersion on Amazon compelling.

Amazon makes price discovery easy, controls some aspects of quality by kicking out sellers who don’t adhere to its policies and provides reasonable indicators of quality of service with its user ratings. But still, on nearly all items that I looked at, there was substantial price dispersion. Take, for instance, the market for a bottle of Nature Made B12 vitamins. Prices go from $8.40 to nearly $30! With taxes, the dispersion is yet greater. If the listing costs are non-zero, it is not immediately clear why sellers selling the product at $30 are in the market. It could be that the expected service quality for the $30 seller is higher except that between the highest price seller and the next highest price seller, the ratings of the highest price seller are lower (take a look at shipping speed as well). And I would imagine that the ratings (and the quality) of Amazon, which comes in with the lowest price, are the highest. More generally, I have a tough time thinking about aspects of service and quality that are worth so much that the range of prices goes from 1x to 4x for a branded bottle of vitamin pills.

One plausible explanation is that the lowest price seller has a non-zero probability of being out of stock. And the more expensive and worse-quality sellers are there to catch these low probability events. They set a price that is profitable for them. One way to think about it is that the marginal cost of additional supply rises in the way the listed prices show. If true, then there seems to be an opportunity to make money. And it is possible that Amazon is leaving money on the table.

p.s. Sales of the boxed set of Harry Potter shows a similar pattern.

It Pays to Search

28 Mar

In Reinventing the Bazaar, John McMillan discusses how search costs affect the price the buyer pays. John writes:

“Imagine that all the merchants are quoting $1[5]. Could one of them do better by undercutting this price? There is a downside to price-cutting: a reduction in revenue from any customers who would have bought from this merchant even at the higher price. If information were freely available, the price-cutter would get a compensating boost in sales as additional customers flocked in. When search costs exist, however, such extra sales may be negligible. If you incur a search cost of 10 cents or more for each merchant you sample, and there are fifty sellers offering the urn, then even if you know there is someone out there who is willing to sell it at cost, so you would save $5, it does not pay you to look for him. You would be looking for a needle in a haystack. If you visited one more seller, you would have a chance of one in fifty of that seller being the price-cutter, so the return on average from that extra price quote would be 10 cents (or $5 multiplied by 1/50), which is the same as your cost of getting one more quote. It does not pay to search.”

Reinventing the Bazaar, John McMillan

John got it wrong. It pays to search. The cost and the expected payoff for the first quote is 10 cents. But if the first quote is $15, the expected payoff for the second quote—(1/49)*$50—is greater than 10 cents. And so on.

Another way to solve for it is to come up with the expected number of quotes you need to get to get to the seller selling at $10. It is 25. Given you need to spend on average $2.50 to get a benefit of $2.50, you will gladly search.

Yet another way to think is that the worst case is that you make no money—when the $10 seller is the last one you get a quote from. But in every other case, you make money.

For the equilibrium price, you need to make assumptions. But if the buyer knows that there is a price cutter, they will all buy from him. This means that the price cutter will be the only seller remaining.

There are two related fun points. First, one of the reasons markets are competitive on price when true search costs are high is likely because people price their time remarkably low. Second, when people spend a bunch of time looking for the cheapest deal, you incentivize all the sellers selling at a high rate to lower their rates and make it better for everyone else.

Faites Attention! Dealing with Inattentive and Insincere Respondents in Experiments

11 Jul

Respondents who don’t pay attention or respond insincerely are in vogue (see the second half of the note). But how do you deal with such respondents in an experiment?

To set the context, a toy example. Say that you are running an experiment. And say that 10% of the respondents in a rush to complete the survey and get the payout don’t read the survey question that measures the dependent variable and respond randomly to it. In such cases, the treatment effect among the 10% will be centered around 0. And including the 10% would attenuate the Average Treatment Effect (ATE).

More formally, in the subject pool, there is an ATE that is E[Y(1)] – E[Y(0)].  You randomly assign folks, and under usual conditions, they render a random sample of Y(1) or Y(0), which in expectation retrieves the ATE.  But when there is pure guessing, the guess by subject i is not centered around Y_i(1) in the treatment group or Y_i(0) in the control group.  Instead, it is centered on some other value that is altogether unresponsive to treatment. 

Now that we understand the consequences of inattention, how do we deal with it?

We could deal with inattentive responding under compliance, but it is useful to separate compliance with the treatment protocol, which can be just picking up the phone, from attention or sincerity with which the respondent responds to the dependent variables. On a survey experiment, compliance plausibly adequately covers both, but cases where treatment and measurement are de-coupled, e.g., happen at different times, it is vital to separate the two.

On survey experiments, I think it is reasonable to assume that:

  1. the proportion of people paying attention are the same across Control/Treatment group, and
  2. there is no correlation between who pays attention and assignment to the control group/treatment group, e.g., men are inattentive in the treatment group and women in the control group.

If the assumptions hold, then the worst we get is an estimate on the attentive subset (principal stratification). To get at ATE with the same research design (and if you measure attention pre-treatment), we can post-stratify after estimating the treatment effect on the attentive subset and then re-weight to account for the inattentive group. (One potential issue with the scheme is that variables used to stratify may have a fair bit of measurement error among inattentive respondents.)

The experimental way to get at attenuation would be to manipulate attention, e.g., via incentives, after the respondents have seen the treatment but before the DV measurement has begun. For instance, see this paper.

Attenuation is one thing, proper standard errors another. People responding randomly will also lead to fatter standard errors, not just because we have fewer respondents but because as Ed Haertel points out (in personal communication):

  1. “The variance of the random responses could be [in fact, very likely is: GS] different [from] the variances in the compliant groups.”
  2. Even “if the variance of the random responses was zero, we’d get noise because although the proportions of random responders in the T and C groups are equal in expectation, they will generally not be exactly the same in any given experiment.”

What’s Best? Comparing Model Outputs

10 Mar

Let’s assume that you have a large portfolio of messages: n messages of k types. And say that there are n models, built by different teams, that estimate how relevant each message is to the user on a particular surface at a particular time. How would you rank order the messages by relevance, understood as the probability a person will click on the relevant substance of the message?

Isn’t the answer: use the max. operator as a service? Just using the max. operator can be a problem because of:

a) Miscalibrated probabilities: the probabilities being output from non-linear models are not always calibrated. A probability of .9 doesn’t mean that there is a 90% chance that people will click it.

b) Prediction uncertainty: prediction uncertainty for an observation is a function of the uncertainty in the betas and distance from the bulk of the points we have observed. If you were to randomly draw a 1,000 samples each from the estimated distribution of p, a different ordering may dominate than the one we get when we compare the means.

This isn’t the end of the problems. It could be that the models are built on data that doesn’t match the data in the real world. (To discover that, you would need to compare expected error rate to actual error rate.) And the only way to fix the issue is to collect new data and build new models of it.

Comparing messages based on propensity to be clicked is unsatisfactory. A smarter comparison would take optimize for profit, ideally over the long term. Moving from clicks to profits requires reframing. Profits need not only come from clicks. People don’t always need to click on a message to be influenced by a message. They may choose to follow-up at a later time. And the message may influence more than the person clicking on the message. To estimate profits, thus, you cannot rely on observational data. To estimate the payoff for showing a message, which is equal to the estimated winning minus the estimated cost, you need to learn it over an experiment. And to compare payoffs of different messages, e.g., encourage people to use a product more, encourage people to share the product with another person, etc., you need to distill the payoffs to the same currency—ideally, cash.

Expertise as a Service

3 Mar

The best thing you can say about Prediction Machines, a new book by a trio of economists, is that it is not barren. Most of the growth you see is about the obvious: the big gain from ML is our ability to predict better, and better predictions will change some businesses. For instance, Amazon will be able to move from shopping-and-then-shipping to shipping-and-then-shopping—you return what you don’t want—if it can forecast what its customers want well enough. Or, airport lounges will see reduced business if we can more accurately predict the time it takes to reach the airport.

Aside from the obvious, the book has some untended shrubs. The most promising of them is that supervised algorithms can have human judgment as a label. We have long known about the point. For instance, self-driving cars use human decisions as labels—we learn braking, steering, speed as a function of road conditions. But what if we could use expert human judgment as a label for other complex cognitive tasks? There is already software that exploits that point. Grammarly, for instance, uses editorial judgments to give advice about grammar and style. But there are so many other places where we could exploit this. You could use it to build educational tools that give guidance on better ways of doing something in real-time. You could also use it to reduce the need for experts.

p.s. The point about exploiting the intellectual property of experts deserves more attention.

Experiments Without Control

4 Jan

Say that you are in the search engine business. And say that you have built a model that estimates how relevant an ad is based on the ‘context’: search query, previous few queries, kind of device, location, and such. Now let’s assume that for context X, the rank-ordered list of ads based on expected profit is: product A, product B, and product C. Now say that you want to estimate how effective an ad for product A is in driving the sales of product A. One conventional way to estimate this is to randomly assign during serve time: for context X, serve half the people an ad for product A and serve half the people no ad. But if it is true (and you can verify this) that an ad for product B doesn’t cause people to buy product A, then you can switch the ‘no ad’ control where you are not making any money with an ad for product B. With this, you can estimate the effectiveness of ad for product A while sacrificing the least amount of revenue. Better yet, if it is true that ad for product A doesn’t cause people to buy product B, you can also at the same time get an estimate of the efficacy of ad for product B.

The Benefit of Targeting

16 Dec

What is the benefit of targeting? Why (and when) do we need experiments to estimate the benefits of targeting? And what is the right baseline to compare against?

I start with a business casual explanation, using examples to illustrate some of the issues at hand. Later in the note, I present a formal explanation to precisely describe the assumptions to clarify under what conditions targeting may be a reasonable thing to do.

Business Casual

Say that you have some TVs to sell. And say that you could show an ad about the TVs to everyone in the city for free. Your goal is to sell as many TVs as possible. Does it make sense for you to build a model to pick out people who would be especially likely to buy the TV and only show an ad to them? No, it doesn’t. Unless ads make people less likely to purchase TVs, you are always better-off reaching out to everyone.

You are wise. You use common sense to sell more TVs than the guy who spent a bunch of money building the model and selling less. You make tons of money. And you use the money to buy Honda and Mercedes dealerships. You still retain the magical power of being able to show ads to everyone for free. Your goal is to maximize profits. And selling Mercedes nets you more profit than Hondas. Should you use a model to show some people ads about Toyota and other people ads about Honda? The answer is still no. Under likely to hold assumptions, the optimal strategy is to show an ad for Mercedes first and then an ad for Toyota. (You can show the Toyota ad first if people who want to buy Mercedes won’t buy a cheaper car if they see an ad for a cheaper car first.)

But what if you are limited to only one ad? What would you do? In that case, a model may make sense. Let’s see how things may look with some fake data. Let’s compare the outcomes of four strategies: two model-based targeting strategies and two target-everyone with one ad strategies. To make things easier, let’s assume that selling Mercedes nets ten units of profits and selling Honda nets five units of profit. Let’s also assume that people will only buy something if they see an ad for their preferred product.

Continue reading here (pdf).

The Value of Predicting Bad Things

30 Oct

Foreknowledge of bad things is useful because it gives us an opportunity to a. prevent it, and b. plan for it.

Let’s refine our intuitions with a couple of concrete examples.

Many companies work super hard to predict customer ‘churn’—which customer is not going to use a product over a specific period (which can be the entire lifetime). If you know who is going to churn in advance, you can: a. work to prevent it, b. make better investment decisions based on expected cash flow, and c. make better resource allocation decisions.

Users “churn” because they don’t think the product is worth the price, which may be because a) they haven’t figured out a way to use the product optimally, b) a better product has come on the horizon, or c) their circumstances have changed. You can deal with this by sweetening the deal. You can prevent users from abandoning your product by offering them discounts. (It is useful to experiment to learn about the precise demand elasticity at various predicted levels of churn.) You can also give discounts is the form of offering some premium features free. Among people who don’t use the product much, you can run campaigns to help people use the product more effectively.

If you can predict cash-flow, you can optimally trade-off risk so that you always have cash at hand to pay your obligations. Churn can also help you with resource allocation. It can mean that you need to temporarily hire more customer success managers. Or it can mean that you need to lay off some people.

The second example is from patient care. If you could predict reasonably that someone will be seriously sick in a year’s time (and you can in many cases), you can use it to prioritize patient care, and again plan investment (if you were an insurance company) and resources (if you were a health services company).

Lastly, as is obvious, the earlier you can learn, the better you can plan. But generally, you need to trade-off between noise in prediction and headstart—things further away are harder to predict. The noise-headstart trade-off is something that should be done thoughtfully and amended based on data.

Optimal Sequence in Which to Service Orders

27 Jul

What is the optimal order in which to service orders assuming a fixed budget?

Let’s assume that we have to service orders o_1,…,…o_n, with the n orders iterated by i. Let’s also assume that for each service order, we know how the costs change over time. For simplicity, let’s assume that time is discrete and portioned in units of days. If we service order o_i at time t, we expect the cost to be c_it. Each service order also has an expiration time, j, after which the order cannot be serviced. The cost at expiration time, j, is the cost of failure and denoted by c_ij.

The optimal sequence of servicing orders is determined by expected losses—service the order first where the expected loss is the greatest. This leaves us with the question of how to estimate expected loss at time t. To come up with an expectation, we need to sum over some probability distribution. For o_it, we need the probability, p_it, that we would service o_i at t+1 till j. And then, we need to multiply p_it with c_ij. So framed, the expected loss for order i at time t =
c_it – \Sigma_{t+1}_{j} p_it * c_it

However, determining p_it is not straightforward. New items are added to the queue at t+1. On the flip side, we also get to re-prioritize at t+1. The question is if we will get to the item o_i at t+1? (It means p_it is 0 or 1.) For that, we need to forecast the kinds of items in the queue tomorrow. One simplification is to assume that items in the queue today are the same that will be in the queue tomorrow. Then, it reduces to estimating the cost of punting each item again tomorrow, sorting based on the costs at t+1, and checking whether we will get to clear the item. (We can forgo the simplification by forecasting our queue tomorrow, and each day after that till j for each item, and calculating the costs.)

If the data are available, we can tack on clearing time per order and get a better answer to whether we will clear o_it at time t or not.

Optimal Sequence in Which to Schedule Appointments

1 Jul

Say that you have a travel agency. Your job is to book rooms at hotels. Some hotels fill up more quickly than others, and you want to figure out which hotels to book at first so that your net booking rate is as high as it could be the staff you have.

The logic of prioritization is simple: prioritize those hotels where the expected loss if you don’t book now is the largest. The only thing we need to do is find a way to formalize the losses. Going straight to formalization is daunting. A toy example helps.

Imagine that there are two hotels Hotel A and Hotel B where if you call 2-days and 1-day in advance, the chances of successfully booking a room are .8 and .8, and .8 and .5 respectively. You can only make one call a day. So it is Hotel A or Hotel B. Also, assume that failing to book a room at Hotel A and Hotel B costs the same.

If you were making a decision 1-day out on which hotel to call to book, the smart thing would be to choose Hotel A. The probability of making a booking is larger. But ‘larger’ can be formalized in terms of losses. Day 0, the probability goes to 0. So you make .8 units of loss with Hotel A and .5 with Hotel B. So the potential loss from waiting is larger for Hotel A than Hotel B.

If you were asked to choose 2-days out, which one should you choose? In Hotel A, if you forgo 2-days out, your chances of successfully booking a room next day are .8. At Hotel B, the chances are .5. Let’s play out the two scenarios. If we choose to book at Hotel A 2-days out and Hotel B 1-day out, our expected batting average is (.8 + .5)/2. If we choose the opposite, our batting average is (.8 + .8)/2. It makes sense to choose the latter. Framed as expected losses, we go from .8 to .8 or 0 expected loss for Hotel A and .3 expected loss for Hotel B. So we should book Hotel B 2-days out.

Now that we have the intuition, let’s move to 3-days, 2-days, and 1-day out as that generalizes to k-days out nicely. To understand the logic, let’s first work out a 101 probability question. Say that you have two fair coins that you toss independently. What is the chance of getting at least one head? The potential options are HH, HT, TH, and TT. The chance is 3/4. Or 1 minus the chance of getting a TT (or two failures) or 1- .5*.5.

The 3-days out example is next. See below for the table. If you miss the chance of calling Hotel A 3-days out, the expected loss is the decline in success in booking 2-days or 1-day out. Assume that the probabilities 2-days out and 1-day our are independent and it becomes something similar to the example about coins. The probability of successfully booking 2-days and 1-days out is thus 1 – the probability of failure. Calculate expected losses for each and now you have a way to which Hotel to call on Day 3.

|       | 3-day | 2-day | 1-day |
|-------|-------|-------|-------|
| Hotel A | .9    | .9    | .4    |
| Hotel B | .9    | .9    | .9    |

In our example, the number for Hotel A and Hotel B come to 1 – (1/10)*(6/10) and 1 – (1/10)*(1/10) respectively. Based on that, we should call Hotel A 3-days out before we call Hotel B.

Targeting 101

22 Jun

Targeting Economics

Say that there is a company that makes more than one product. And users of any one of its products don’t use all of its products. In effect, the company has a \textit{captive} audience. The company can run an ad in any of its products about the one or more other products that a user doesn’t use. Should it consider targeting—showing different (number of) ads to different users? There are five things to consider:

  • Opportunity Cost: If the opportunity is limited, could the company make more profit by showing an ad about something else?
  • The Cost of Showing an Ad to an Additional User: The cost of serving an ad; it is close to zero in the digital economy.
  • The Cost of a Worse Product: As a result of seeing an irrelevant ad in the product, the user likes the product less. (The magnitude of the reduction depends on how disruptive the ad is and how irrelevant it is.) The company suffers in the end as its long-term profits are lower.
  • Poisoning the Well: Showing an irrelevant ad means that people are more likely to skip whatever ad you present next. It reduces the company’s ability to pitch other products successfully.
  • Profits: On the flip side of the ledger are expected profits. What are the expected profits from showing an ad? If you show a user an ad for a relevant product, they may not just buy and use the other product, but may also become less likely to switch from your stack. Further, they may even proselytize your product, netting you more users.

I formalize the problem here (pdf).

Learning About [the] Loss (Function)

7 Nov

One of the things we often want to learn is the actual loss function people use for discounting ideological distance between self and a legislator. Often people try to learn the loss function using over actual distances. But if the aim is to learn the loss function, perceived distance rather than actual distance is better. It is so because perceived = what the voter believes to be true. People can then use the function to simulate out scenarios if perceptions = fact.

Incentives to Care

11 Sep

A lot of people have their lives cut short because they eat too much and exercise too little. Worse, the quality of their shortened lives is typically much lower as a result of avoidable' illnesses that stem frombad behavior.’ And that isn’t all. People who are not feeling great are unlikely to be as productive as those who are. Ill-health also imposes a significant psychological cost on loved ones. The net social cost is likely enormous.

One way to reduce such costly avoidable misery is to invest upfront. Teach people good habits and psychological skills early on, and they will be less likely to self-harm.

So why do we invest so little up front? Especially when we know that people are ill-informed (about the consequences of their actions) and myopic.

Part of the answer is that there are few incentives for anyone else to care. Health insurance companies don’t make their profits by caring. They make them by investing wisely. And by minimizing ‘avoidable’ short-term costs. If a member is unlikely to stick with a health plan for life, why invest in their long-term welfare? Or work to minimize negative externalities that may affect the next generation?

One way to make health insurance care is to rate them on estimated quality-adjusted years saved due to interventions they sponsored. That needs good interventions and good data science. And that is an opportunity. Another way is to get the government to invest heavily early on to address this market failure. Another version would be to get the government to subsidize care that reduces long-term costs.