Growth Funnels

24 Sep

You spend a ton of time building a product to solve a particular problem. You launch. Then, either the kinds of people whose problem you are solving never arrive or they arrive and then leave. You want to know why because if you know the reason, you can work to address the underlying causes.

Often, however, businesses only have access to observational data. And since we can’t answer the why with observational data, people have swapped the why with the where. Answering the where can give us a window into the why (though only a window). And that can be useful.

The traditional way of posing ‘where’ is called a funnel. Conventionally, funnels start at when the customer arrives on the website. We will forgo convention and start at the top.

There are only three things you should work to optimally do conditional on the product you have built:

1. Help people discover your product
2. Effectively convey the relevant value of the product to those who have discovered your product
3. Help people effectively use the product

p.s. When the product changes, you have to do all three all over again.

One way funnels can potentially help is triage. How big is the ‘leak’ at each ‘stage’? The funnel on the top of the first two steps is: of the people who discovered the product, how many did we successfully communicate the relevant value of the product to? Posing the problem in such a way makes it seem more powerful than it is. To come up with a number, you generally only have noisy behavioral signals. For instance, the proportion of people who visit the site who sign up. But low proportions could be of large denominators—lots of people are visiting the site but the product is not relevant for most of them. (If bringing people to the site costs nothing, there is nothing to do.) Or it could be because you have a super kludgy sign-up process. You could drill down to try to get at such clues, but the number of potential locations for drilling remains large.

That brings us to the next point. Macro-triaging is useful but only for resource allocation kinds of decisions based on some assumptions. It doesn’t provide a way to get concrete answers to real issues. For concrete answers, you need funnels around concrete workflows. For instance, for a referral ‘product’ for AirBnB, one way to define steps (based on Gustaf Alstromer) is as follows:

1. how many people saw the link asking people to refer
2. of the people who saw the link, how many clicked on it
3. of the people who clicked on it, how many invited people
4. of the people who were invited, how many signed up (as a user, guest, host)
5. of the people who signed up, how many made the first booking

Such a funnel allows you to optimize the workflow by experimenting with the user interface at each stage. It allows you to analyze what the users were trying to do but failed to do, or took a long time doing. It also allows you to analyze how optimally (from a company’s perspective) are users taking an action. For instance, people may want to invite all their friends but the UI doesn’t have a convenient way to import contacts from email.

Sometimes just writing out the steps in a workflow can be useful. It allows people to think if some steps are actually needed. For instance, is signing-in needed before we allow a user to understand the value of the product?

Conscious Uncoupling: Separating Compliance from Treatment

18 Sep

Let’s say that we want to measure the effect of a phone call encouraging people to register to vote on voting. Let’s define compliance as a person taking the call. And let’s assume that the compliance rate is low. The traditional way to estimate the effect of the phone call is via an RCT: randomly split the sample into Treatment and Control, call everyone in the Treatment Group, wait till after the election, and calculate the difference in the proportion who voted. Assuming that the treatment doesn’t affect non-compliers, etc., we can also estimate the Complier Average Treatment Effect.

But one way to think about non-compliance in the example above is as follows: “Buddy, you need to reach these people using another way.” That is a super useful thing to know, but it is an observational point. You can fit a predictive model for who picks up phone calls and who doesn’t. The experiment is useful in answering how much can you persuade the people you reach on the phone. And you can learn that by randomizing conditional on compliance.

For such cases, here’s what we can do:

1. Call a reasonably large random sample of people. Learn a model for who complies.
2. Use it to target people who are likelier to comply and randomize post a person picking up.

More generally, Average Treatment Effect is useful for global rollouts of one policy. But when is that a good counterfactual to learn? Tautologically, when that is all you can do or when it is the optimal thing to do. If we are not in that world, why not learn about—and I am using the example to be concrete—a) what is a good way to reach me, b) what message do you want to show me. For instance, for political campaigns, the optimal strategy is to estimate the cost of reaching people by phone, mail, f2f, etc., estimate the probability of reaching each using each of the media, estimate the payoff for different messages for different kinds of people, and then target using the medium and the message that delivers the greatest benefit. (For a discussion about targeting, see here.)

But technically, a message could have the greatest payoff for the person who is least likely to comply. And the optimal strategy could still be to call everyone. To learn treatment effects among people who are unlikely to comply (using a particular method), you will need to build experiments to increase compliance. More generally, if you are thinking about multi-arm bandits or some such dynamic learning system, the insight is to have treatment arms around both compliance and message. The other general point, implicit in the essay, is that rather than be fixated on calculating ATE, we should be fixated on an optimization objective, e.g., the additional number of people persuaded to turn out to vote per dollar.

Prediction Errors: Using ML For Measurement

1 Sep

Say you want to measure the how often people visit pornographic domains over some period. To measure that, you build a model to predict whether or not a domain hosts pornography. And let’s assume that for the chosen classification threshold, the False Positive rate (FP) is 10\% and the False Negative rate (FN) is 7\%. Here below, we discuss some of the concerns with using scores from such a model and discuss ways to address the issues.

Let’s get some notation out of the way. Let’s say that we have $n$ users and that we can iterate over them using $i$. Let’s denote the total number of unique domains—domains visited by any of the $n$ users at least once during the observation window—by $k$. And let’s use $j$ to iterate over the domains. Let’s denote the number of visits to domain $j$ by user $i$ by $c_{ij} = {0, 1, 2, ....}$. And let’s denote the total number of unique domains a person visits ($\sum (c_{ij} == 1)$) using $t_i$. Lastly, let’s denote predicted labels about whether or not each domain hosts pornography by $p$, so we have $p_1, ..., p_j, ... , p_k$.

Let’s start with a simple point. Say there are 5 domains with $p$: ${1_1, 1_2, 1_3, 1_4, 1_5}$. Let’s say user one visits the first three sites once and let’s say that user two visits all five sites once. Given 10\% of the predictions are false positives, the total measurement error in user one’s score $= 3 * .10$ and the total measurement error in user two’s score $= 5 * .10$. The general point is that total false positives increase as a function of predicted $1s$. And the total number of false negative increase as the number of predicted $0s$.

Technical Leadership: Building Great Teams and Products

31 Aug

Companies bet on people. They bet that the people they have hired will figure out the problems and their solutions. To make the bet credible, the golden rule is to hire competent people. Other virtues matter, but competence generally matters the most in technical roles.

Once you have made the bet, you must double down on it. Trust the people you hire. And by being open, empathetic, transparent, thoughtful, self-aware, good listeners, and excellent at our job, earn their trust. Once people are sure that you are there to enable them and grow them, increasing their productivity and growing them becomes a lot easier.

After laying the foundation of trust, you need to build culture. The important aspects of culture are:

1. inclusion
2. intellectual openness
3. technical excellence
4. focus on the business problem
5. interest in doing the right thing
6. accountability
7. hard work

To build a culture of technical excellence and intellectual openness, I use the following techniques:

• Probe everything. Ask why. Ask why again.
• Ask people to explain things as simply as possible.
• For anything complex, ask people to write things down in plain English.
• Get perspective from people with different technical competencies.
• Frontload your thinking on a project.
• Lead a meeting to ‘think together’ about projects where tough, thoughtful, probing is encouraged.
• Establish guidelines for coding excellence and peer review each major PR.
• Lead a meeting where different members of the team teach each other — I call it a ‘learn together.’

To make sure that the team has a place to share and reflect on the understanding, I also work hard to build a shared understanding. Building a shared understanding of the space helps drive clarity about actions. It also provides a jumping-off point for planning. Without knowing what you know, you start each planning cycle like it is groundhog day.

I actively use Notion to think through the broader problem and the sub-problems and spend time organizing and grooming it. Build something that inspires others to build that shared understanding.

I encourage active discussion and shared ownership. We own each page as a team, if not as a company. But eventually, one person is listed as a primary point of contact who takes the responsibility of keeping it updated.

To drive accountability, I drive transparency. For that, I use the following methods:

1. Share folders. Shared team Google drive folder with project-level folders nested within it. I dump most documents that are sent to me in the shared folder and organize the folder periodically. There is even a personal growth folder where I store things like ‘how to write an effective review,’ etc.
2. Be a Good Router: I encourage people to err toward spamming. I catch whatever failures by 1. forwarding relevant emails, 2. Posting attachments to Google Drive, and 3. posting to the relevant section on Notion.
3. Visibility on intermediate outputs: I encourage everyone to post intermediate outputs, questions, concerns, etc. to the team channel, and encourage people to share their final outputs widely.
4. Use PPP: Each member of the team shares a Progress, Plans, and Problems each week along with links to outputs.

Culture allows you to produce great things if you have the right process and product focus. There are some principles of product and process management that I follow. My tenets of product management are:

• Formalize the business problem
• Define the API before moving too far out.
• Scope and execute on a V1 than before executing on the full solution. Give yourself the opportunity to learn from the world.
• Do less. Do well.

The process for product development that I ask the team to follow can be split into planning, execution, and post-execution.

Planning: What to Work On?

1. Try to understand the question and the use case as best as you can. Get as much detail as possible.
2. Talk to at least one more person on the team and ask them to convince you why you shouldn’t do this followed by if you are doing this, what are the other things I am missing here.
3. For larger projects, write a document and get feedback from all the stakeholders.
4. As needed, hold a think together with a diverse, cross-functional group. Keep the group small enough. Generally, I have found ~5—8 is the sweet spot.
5. Get an ROI estimate and rank against other ideas.

How to execute?

1. Own your project on whatever project management software you use.
2. Own communication on where things are on the project, new discoveries, new challenges, etc.
3. Ask people to review each commit. Treat reviewing seriously. Generally, the review process that works well is people explaining each line to another person. In the act of explaining, you catch your hidden assumptions, etc.
4. Come up with at least a few sanity checks for your code.

Communicating Within and Outside What You Learn

1. Share discoveries aggressively. Err on the side of over-communication.
2. Contribute to Notion so that there is always a one-stop-shop around how we understand X.
3. Create effective presentations

Comparing Ad Targeting Regimes

30 Aug

Ad targeting is often useful when you have multiple things to sell (opportunity cost) or when the cost of running an ad is non-trivial or when an irrelevant ad reduces your ability to reach the user later or any combination of the above. (For a more formal treatment, see here.)

But say that you want proof—you want to estimate the benefit of targeting. How would you do it?

When there is one product to sell, some people have gone about it as follows: randomize to treatment and control, show the ad to a random subset of respondents in the control group and an equal number of respondents picked by a model in the treatment group, and compare the outcomes of the two groups (it reduces to comparing subsets unless there are spillovers). This experiment can be thought off as a way to estimate how to spend a fixed budget optimally. (In this case, the budget is the number of ads you can run.) But if you were interested in finding out whether a budget allocated by a model would be more optimal than say random allocation, you don’t need an experiment (unless there are spillovers). All you need to do is show the ad to a random set of users. For each user, you know whether or not they would have been selected to see an ad by the model. And you can use this information to calculate payoffs for the respondents chosen by the model, and for the randomly selected group.

Let me expand for clarity. Say that you can measure profit from ads using CTR. Say that we have built two different models for selecting people to whom we should show ads—Model A and Model B. Now say that we want to compare which model yields a higher CTR. We can have four potential scenarios for selection of respondents by the model:

model_a, model_b
0, 0
1, 0
0, 1
1, 1

For CTR, 0-0 doesn’t add any information. It is the conditional probability. To measure which of the models is better, draw a fixed size random sample of users picked by model_a and another random sample of the same size from users picked by model_b and compare CTR. (The same user can be picked twice. It doesn’t matter.)

Now that we know what to do, let’s understand why experiments are wasteful. The heuristic account is as follows: experiments are there to compare ‘similar people.’ When estimating allocative efficiency of picking different sets of people, we are tautologically comparing different people. That is the point of the comparison.

All this still leaves the question of how would we measure the benefit of targeting. If you had only one ad to run and wanted to choose between showing an advertisement to everyone versus fewer people, then show the ad to everyone and estimate profits based on the rows selected in the model and profits from showing the ad to everyone. Generally, showing an ad to everyone will win.

If you had multiple ads, you would need to randomize. Assign each person in the treatment group to a targeted ad. In the control group, you could show an ad for a random product. Or you could show an advertisement for any one product that yields the maximum revenue. Pick whichever number is higher as the one to compare against.

What’s Relevant? Learning from Organic Growth

26 Aug

Say that we want to find people to whom a product is relevant. One way to do that is to launch a small campaign advertising the product and learn from people who click on the ad, or better yet, learn from people who not just click on the ad but go and try out the product and end up using it. But if you didn’t have the luxury of running a small campaign and waiting a while, you can learn from organic growth.

Conventionally, people learn from organic growth by posing it as a supervised problem. And they generate the labels as follows: people who have ‘never’ (mostly: in the last 6–12 months) used the product are labeled as 0 and people who “adopted” the product in the latest time period, e.g., over the last month, are labeled 1. People who have used the product in the last 6–12 months or so are filtered out.

There are three problems with generating labels this way. First, not all the people who ‘adopt’ a product continue to use the product. Many of the people who try it find that it is not useful or find the price too high and abandon it. This means that a lot of 1s are mislabeled. Second, the cleanest 1s are the people who ‘adopted’ the product some time ago and have continued to use it since. Removing those is thus a bad idea. Third, the good 0s are those who tried the product but didn’t persist with it not those who never tried the product. Generating the labels in such a manner also allows you to mitigate one of the significant problems with learning from organic growth: people who organically find a product are different from those who don’t. Here, you are subsetting on the kinds of people who found the product, except that one found it useful and another did not. This empirical strategy has its problems, but it is distinctly better than the conventional approach.

Sometimes Scientists Spread Misinformation

24 Aug

To err is human. Good scientists are aware of that, painfully so. The model scientist obsessively checks everything twice over and still keeps eyes peeled for loose ends. So it is a shock to learn that some of us are culpable for spreading misinformation.

Ken and I find that articles with serious errors, even articles based on fraudulent data, continue to be approvingly cited—cited without any mention of any concern—long after the problems have been publicized. Using a novel database of over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31% of the citations to retracted articles happen a year after the publication of the retraction notice. And that over 90% of these citations are approving.

What gives our findings particular teeth is the role citations play in science. Many, if not most, claims in a scientific article rely on work done by others. And scientists use citations to back such claims. The readers rely on scientists to note any concerns that impinge on the underlying evidence for the claim. And when scientists cite problematic articles without noting any concerns they very plausibly misinform their readers.

Though 74,000 is a large enough number to be deeply concerning, retractions are relatively infrequent. And that may lead some people to discount these results. Retractions may be infrequent but citations to retracted articles post-retraction are extremely revealing. Retractions are a low-low bar. Retractions are often a result of convincing evidence of serious malpractice, generally fraud or serious error. Anything else, for example, a serious error in data analysis, is usually allowed to self-correct. And if scientists are approvingly citing retracted articles after they have been retracted, it means that they have failed to hurdle the low-low bar. Such failure suggests a broader malaise.

To investigate the broader malaise, Ken and I exploited data from an article published in Nature that notes a statistical error in a series of articles published in prominent journals. Once again, we find that approving citations to erroneous articles persist after the error has been publicized. After the error has been publicized, the rate of citation to erroneous articles is, if anything, higher, and 98% of the citations are approving.

In all, it seems, we are failing.

The New Unit of Scientific Production

11 Aug

One fundamental principle of science is that there is no privileged observer. You get to question what people did. But to question, you first must know what people did. So part of good scientific practice is to make it easy for people to understand how the sausage was made—how the data were collected, transformed, and analyzed—and ideally, why you chose to make the sausage that particular way. Papers are ok places for describing all this, but we now have better tools: version controlled repositories with notebooks and readme files.

The barrier to understanding is not just lack of information, but also poorly organized information. There are three different arcs of information: cross-sectional (where everything is and how it relates to each other), temporal (how the pieces evolve over time), and inter-personal (who is making the changes). To be organized cross-sectionally, you need to be macro organized (where is the data, where are the scripts, what do each of the scripts do, how do I know what the data mean, etc.), and micro organized (have logic and organization to each script; this also means following good coding style). Temporal organization in version control simply requires you to have meaningful commit messages. And inter-personal organization requires no effort at all, beyond the logic of pull requests.

The obvious benefits of this new way are known. But what is less discussed is that this new way allows you to critique specific pull requests and decisions made in certain commits. This provides an entirely new way to make progress in science. The new unit of science also means that we just don’t dole out credits in crude currency like journal articles but we can also provide lower denominations. We can credit each edit, each suggestion. And why not. The third big benefit is that we can build epistemological trees where the logic of disagreement is clear.

The dead tree edition is dead. It is also time to retire the e-version of the dead tree edition.

Quality Data: Plumbing ML Data Pipelines

6 Aug

What’s the difference between a scientist and a data scientist? Scientists often collect their own data, and data scientists often use data collected by other people. That is part jest but speaks to an important point. Good scientists know their data. Good data scientists must know their data too. To help data scientists learn about the data they use, we need to build systems that give them good data about the data. But what is good data about the data? And how do we build systems that deliver that? Here’s some advice (tailored toward rectangular data for convenience):

• From Where, How Much, and Such
• Provenance: how were each of the columns in the data created (obtained)? If the data are derivative, find out the provenance of the original data. Be as concrete as possible, linking to scripts, related teams, and such.
• How Frequently is it updated
• Cost per unit of data, e.g., a cell in rectangular data.
Both, the frequency with which data are updated, and the cost per unit of data may change over time. Provenance may change as well: a new team (person) may start managing data. So the person who ‘owns’ the data must come back to these questions every so often. Come up with a plan.
• What? To know what the data mean, you need a data dictionary. A data dictionary explains the key characteristics of the data. It includes:
1. Information about each of the columns in plain language.
2. How were the data were collected? For instance, if you conducted a survey, you need the question text and the response options (if any) that were offered, along with the ‘mode’, and where in a sequence of questions does this lie, was it alone on the screen, etc.
3. Data type
4. How (if at all) are missing values generated?
5. For integer columns, it gives the range, sd, mean, median, n_0s, and n_missing. For categorical, it gives the number of unique values, what each label means, and a frequency table that includes n_missing (if missing can be of multiple types, show a row for each).
6. The number of duplicates in data and if they are allowed and a reason for why you would see them.
7. Number of rows and columns
8. Sampling
9. For supervised models, store correlation of y with key x_vars
• What If? What if you have a question? Who should you bug? Who ‘owns’ the ‘column’ of data?

Store these data in JSON so that you can use this information to validate against. Produce the JSON for each update. You can flag when data are some s.d. above below last ingest.

Store all this metadata with the data. For e.g., you can extend the dataframe class in Scala to make it so.

Auto-generate reports in markdown with each ingest.

In many ML applications, you are also ingesting data back from the user. So you need the same as above for the data you are getting from the user (and some of it at least needs to match the stored data).

For any derived data, you need the scripts and the logic, ideally in a notebook. This is your translation function.

Where possible, follow the third normal form of databases. Only store translations when translation is expensive. Even then, think twice.

Lastly, some quality control. Periodically sit down with your team to see if you should see what you are seeing. For instance, if you are in the survey business, do the completion times make sense? If you are doing supervised learning, get a random sample of labels. Assess their quality. You can also assess the quality by looking at errors in classification that your supervised model makes. Are the errors because the data are mislabeled? Keep iterating. Keep improving. And keep cataloging those improvements. You should be able to ‘diff’ data collection, not just numerical summaries of data. And with what the method I highlight above, you should be.

Optimal Sequence in Which to Service Orders

27 Jul

What is the optimal order in which to service orders assuming a fixed budget?

Let’s assume that we have to service orders o_1,…,…o_n, with the n orders iterated by i. Let’s also assume that for each service order, we know how the costs change over time. For simplicity, let’s assume that time is discrete and portioned in units of days. If we service order o_i at time t, we expect the cost to be c_it. Each service order also has an expiration time, j, after which the order cannot be serviced. The cost at expiration time, j, is the cost of failure and denoted by c_ij.

The optimal sequence of servicing orders is determined by expected losses—service the order first where the expected loss is the greatest. This leaves us with the question of how to estimate expected loss at time t. To come up with an expectation, we need to sum over some probability distribution. For o_it, we need the probability, p_it, that we would service o_i at t+1 till j. And then, we need to multiply p_it with c_ij. So framed, the expected loss for order i at time t =
c_it – \Sigma_{t+1}_{j} p_it * c_it

However, determining p_it is not straightforward. New items are added to the queue at t+1. On the flip side, we also get to re-prioritize at t+1. The question is if we will get to the item o_i at t+1? (It means p_it is 0 or 1.) For that, we need to forecast the kinds of items in the queue tomorrow. One simplification is to assume that items in the queue today are the same that will be in the queue tomorrow. Then, it reduces to estimating the cost of punting each item again tomorrow, sorting based on the costs at t+1, and checking whether we will get to clear the item. (We can forgo the simplification by forecasting our queue tomorrow, and each day after that till j for each item, and calculating the costs.)

If the data are available, we can tack on clearing time per order and get a better answer to whether we will clear o_it at time t or not.