Canonical Insights

20 Oct

If the canonical insight of computer science is automating repetition, the canonical insight of data science is optimization. It isn’t that computer scientists haven’t thought about optimization. They have. But computer scientists weren’t the first to think about automation, just like economists weren’t the first to think that incentives matter. Automation is just the canonical, foundational, purpose of computer science.

Similarly, optimization is the canonical, foundational purpose of data science. Data science aims to provide the “optimal” action at time t conditional on what you know. And it aims to do that by learning from data optimally. For instance, if the aim is to separate apples from oranges, the aim of supervised learning is to give the best estimate of whether the fruit is an apple or an orange given data.

For certain kinds of problems, the optimal way to learn from data is not to exploit found data but to learn from new data collected in an optimal way. For instance, randomized inference also us to compare two arbitrary regimes. And say if you want to optimize persuasiveness, you need to continuously experiment with different pitches (the number of dimensions on which pitches can be generated can be a lot), some of which exploit human frailties (which vary by people) and some that will exploit the fact that people need to be pitched the relevant value and that relevant value differs across people.

Once you know the canonical insight of a discipline, it opens up all the problems that can be “solved” by it. It also tells you what kind of platform you need to build to make optimal decisions for that problem. For some tasks, the “platform” may be supervised learning. For other tasks, like ad persuasiveness, it may be a platform that combines supervised learning (for targeting) and experimentation (for optimizing the pitch).

Don’t Expose Yourself! Discretionary Exposure to Political Information

10 Oct

As the options have grown, so have the fears. Are the politically disinterested taking advantage of the nearly limitless options to opt out of news entirely? Are the politically interested siloing themselves into “echo chambers”? In an eponymous Oxford Research Encylopedia article, I discuss what we think we know, and some concerns about how we can know. Some key points:

  • Is the gap between how much the politically interested and politically disinterested know about politics increasing, as Post-broadcast Democracy posits? Figure 1 suggests not.

  • Quantity rather than ratio: “If the dependent variable is partisan affect, how ‘selective’ one is may not matter as much as the net imbalance in consumption—the difference between the number of congenial and uncongenial bits consumed…”

  • To measure how much political information a person is consuming, you must be able to distinguish political information from its complement. But what isn’t political information? “In this chapter, our focus is on consumption of varieties of political information. The genus is political information. And the species of this genus differ in congeniality, among other things. But what is political information? All information that influences people’s political attitudes or behaviors? If so, then limiting ourselves to news is likely too constraining. Popular television shows like The Handmaid’s Tale, Narcos, and Law and Order have clear political themes. … Shows like Will and Grace and The Cosby Show may be less clearly political, but they also have a political subtext.” (see Figure 4) … “Even if we limit ourselves to news, the domain is still not clear. Is news about a bank robbery relevant political information? What about Hillary Clinton’s haircut? To the extent that each of these affect people’s attitudes, they are arguably pertinent. “

  • One of the challenges with inferring consumption based on domain level data is that domain level data are crude. Going to http://nytimes.com is not the same as reading political news. And measurement error may vary by the kind of person. For instance, say we label http://nytimes.com as political news. For the political junkie, the measurement error may be close to zero. For teetotalers, it may be close to 100% (see more).

  • Show people a few news headlines along with the news source (you can randomize the source). What can you learn from a few such ‘trials’? You cannot learn what proportion of news they get from a particular source. you can learn the preferences, but not reliably. More from the paper: “Given the problems with self-reports, survey instruments that rely on behavioral measures are plausibly better. … We coded congeniality trichotomously: congenial, neutral, or uncongenial. The correlations between trials are alarmingly low. The polychoric correlation between any two trials range between .06 to .20. And the correlation between choosing political news in any two trials is between -.01 and .05.”

  • Following up on the previous point: preference for a source which has a mean slant != preference for slanted news. “Current measures of [selective exposure] are beset with five broad problems. First is conceptual errors. For instance, people frequently equate preference for information from partisan sources with a preference for congenial information.”

Computing Optimal Cut-Offs

7 Oct

Probabilities from classification models can have two problems:

  1. Miscalibration: A p of .9 often doesn’t mean a 90% chance of 1 (assuming a dichotomous y). (You can calibrate it using isotonic regression.)

  2. Optimal cut-offs: For multi-class classifiers, we do not know what probability value will maximize the accuracy or F1 score. Or any metric for which you need to trade-off between FP and FN.

One way to solve #2 is to run the true labels (out of sample, otherwise there is concern about bias) and probabilities through a brute-force optimizer and gives you the optimal cut-off for the metric. Here’s the script for doing the same along with an illustration.

Online Learning With Biased Sampling

3 Oct

Say that you train a model to predict who will click on an ad. Say that you deploy the model to only show ads to people who are likely to click on them. (For a discussion about the optimal strategy for who to show ads to, see here.) And say you use the clicks from the people who see the ad to continue to tune the parameters. (This is a close approximation of a standard implementation of online learning in online advertising.)

In effect, once you launch the model, you only get data from a biased set of users. Such a sampling bias can be a problem when the data generating process (how the 1s and the 0s are generated) changes in a way such that changes above the threshold (among the kinds of people who we get data from) are uncorrelated with how it changes below the threshold (among the people who we do not get data from). The concerning aspect is that if this happens, the model continues to “work,” in that the accuracy can continue to be high even as recall (the proportion of people for whom the ad is relevant) becomes lower over time. There is only one surefire way to diagnose the issue and address it: continue to collect some data from people below the threshold and learn if the data generating process is changing.

Some Facts About Indian Polling Stations

27 Sep

Of the 748,584 polling stations for which we have self-reported data on building conditions, nearly 24% report having Internet. A similar number report having “Landline Telephone/Fax Connection.”

97.7% report having toilets for men and women.

2.6% report being in a “dilapidated or dangerous” building.

93.2% report having ramps for the disabled. 98.3% report having “proper road connectivity.” Nearly 4% report being located at a place where the “voters have to cross river/valley/ravine or natural obstacle to reach PS.”

92% of the polling stations are located in “Govt building/Premises.” And 11.4% are reportedly located in “an institution/religious place.”

8% report having a “political party office situated within 200 meters of PS premises.”

For underlying data and scripts, see here.

Growth Funnels

24 Sep

You spend a ton of time building a product to solve a particular problem. You launch. Then, either the kinds of people whose problem you are solving never arrive or they arrive and then leave. You want to know why because if you know the reason, you can work to address the underlying causes.

Often, however, businesses only have access to observational data. And since we can’t answer the why with observational data, people have swapped the why with the where. Answering the where can give us a window into the why (though only a window). And that can be useful.

The traditional way of posing ‘where’ is called a funnel. Conventionally, funnels start at when the customer arrives on the website. We will forgo convention and start at the top.

There are only three things you should work to optimally do conditional on the product you have built:

  1. Help people discover your product
  2. Effectively convey the relevant value of the product to those who have discovered your product
  3. Help people effectively use the product

p.s. When the product changes, you have to do all three all over again.

One way funnels can potentially help is triage. How big is the ‘leak’ at each ‘stage’? The funnel on the top of the first two steps is: of the people who discovered the product, how many did we successfully communicate the relevant value of the product to? Posing the problem in such a way makes it seem more powerful than it is. To come up with a number, you generally only have noisy behavioral signals. For instance, the proportion of people who visit the site who sign up. But low proportions could be of large denominators—lots of people are visiting the site but the product is not relevant for most of them. (If bringing people to the site costs nothing, there is nothing to do.) Or it could be because you have a super kludgy sign-up process. You could drill down to try to get at such clues, but the number of potential locations for drilling remains large.

That brings us to the next point. Macro-triaging is useful but only for resource allocation kinds of decisions based on some assumptions. It doesn’t provide a way to get concrete answers to real issues. For concrete answers, you need funnels around concrete workflows. For instance, for a referral ‘product’ for AirBnB, one way to define steps (based on Gustaf Alstromer) is as follows:

  1. how many people saw the link asking people to refer
  2. of the people who saw the link, how many clicked on it
  3. of the people who clicked on it, how many invited people
  4. of the people who were invited, how many signed up (as a user, guest, host)
  5. of the people who signed up, how many made the first booking

Such a funnel allows you to optimize the workflow by experimenting with the user interface at each stage. It allows you to analyze what the users were trying to do but failed to do, or took a long time doing. It also allows you to analyze how optimally (from a company’s perspective) are users taking an action. For instance, people may want to invite all their friends but the UI doesn’t have a convenient way to import contacts from email.

Sometimes just writing out the steps in a workflow can be useful. It allows people to think if some steps are actually needed. For instance, is signing-in needed before we allow a user to understand the value of the product?

Conscious Uncoupling: Separating Compliance from Treatment

18 Sep

Let’s say that we want to measure the effect of a phone call encouraging people to register to vote on voting. Let’s define compliance as a person taking the call (like they do in Gerber and Green, 2000, etc.). And let’s assume that the compliance rate is low. The traditional way to estimate the effect of the phone call is via an RCT: randomly split the sample into Treatment and Control, call everyone in the Treatment Group, wait till after the election, and calculate the difference in the proportion who voted. Assuming that the treatment doesn’t affect non-compliers, etc., we can also estimate the Complier Average Treatment Effect.

But one way to think about non-compliance in the example above is as follows: “Buddy, you need to reach these people using another way.” That is a useful thing to know, but it is an observational point. You can fit a predictive model for who picks up phone calls and who doesn’t. The experiment is useful in answering how much can you persuade the people you reach on the phone. And you can learn that by randomizing conditional on compliance.

For such cases, here’s what we can do:

  1. Call a reasonably large random sample of people. Learn a model for who complies.
  2. Use it to target people who are likelier to comply and randomize post a person picking up.

More generally, the Average Treatment Effect is useful for global rollouts of one policy. But when is that a good counterfactual to learn? Tautologically, when that is all you can do or when it is the optimal thing to do. If we are not in that world, why not learn about—and I am using an example to be concrete—a) what is a good way to reach me? b) what message most persuades me? For instance, for political campaigns, the optimal strategy is to estimate the cost of reaching people by phone, mail, f2f, etc., estimate the probability of reaching each using each of the media, estimate the payoff for different messages for different kinds of people, and then target using the medium and the message that delivers the greatest benefit. (For a discussion about targeting, see here.)

Technically, a message could have the greatest payoff for the person who is least likely to comply. And the optimal strategy could still be to call everyone. To learn treatment effects among people who are unlikely to comply (using a particular method), you will need to build experiments to increase compliance. More generally, if you are thinking about multi-arm bandits or some such dynamic learning system, the insight is to have treatment arms around both compliance and message. The other general point, implicit in the essay, is that rather than be fixated on calculating ATE, we should be fixated on an optimization objective, e.g., the additional number of people persuaded to turn out to vote per dollar.

Sidebar

It is useful to think about the cost and benefit of an incremental voter. Let’s say you are a strategist for party p given the task of turning out voters. Here’s one way to think about the problem:

  1. The benefit of turning out a voter in an election is not limited to the election. It also increases the probability of them turning out in the next election. The benefit is pro-rated by the voter’s probability of voting for party p.

  2. The cost of turning out a voter is a sum of targeting costs and persuasion costs. The targeting costs could be the cost of identifying voters unlikely to vote unless contacted who would likely vote for party p or you could also build a model for persuadability and target further based on that. The persuasion costs include the cost of contacting the voter and persuading the voter

  3. The cost of turning out a voter is likely greater than the cost of voting. For instance, some campaigns spend $150, some others think it is useful to spend as much as $1000. If cash transfers were allowed, we should be able to get people to vote at much lower prices. But given cash transfers aren’t allowed, the only option is persuasion and that is generally expensive.

Prediction Errors: Using ML For Measurement

1 Sep

Say you want to measure how often people visit pornographic domains over some period. To measure that, you build a model to predict whether or not a domain hosts pornography. And let’s assume that for the chosen classification threshold, the False Positive rate (FP) is 10\% and the False Negative rate (FN) is 7\%. Here below, we discuss some of the concerns with using scores from such a model and discuss ways to address the issues.

Let’s get some notation out of the way. Let’s say that we have n users and that we can iterate over them using i. Let’s denote the total number of unique domains—domains visited by any of the n users at least once during the observation window—by k. And let’s use j to iterate over the domains. Let’s denote the number of visits to domain j by user i by c_{ij} = {0, 1, 2, ....}. And let’s denote the total number of unique domains a person visits (\sum (c_{ij} == 1)) using t_i. Lastly, let’s denote predicted labels about whether or not each domain hosts pornography by p, so we have p_1, ..., p_j, ... , p_k.

Let’s start with a simple point. Say there are 5 domains with p: {1_1, 1_2, 1_3, 1_4, 1_5}. Let’s say user one visits the first three sites once and let’s say that user two visits all five sites once. Given 10\% of the predictions are false positives, the total measurement error in user one’s score = 3 * .10 and the total measurement error in user two’s score = 5 * .10. The general point is that total false positives increase as a function of predicted 1s. And the total number of false negatives increase as the number of predicted 0s.

Read more here.

Technical Leadership: Building Great Teams and Products

31 Aug

Companies bet on people. They bet that the people they have hired will figure out the problems and their solutions. To make the bet credible, the golden rule is to hire competent people. Other virtues matter, but competence generally matters the most in technical roles.

Once you have made the bet, you must double down on it. Trust the people you hire. And by being open, empathetic, transparent, thoughtful, self-aware, good listeners, and excellent at our job, earn their trust. Once people are sure that you are there to enable them and grow them, increasing their productivity and growing them becomes a lot easier.  

After laying the foundation of trust, you need to build culture. The important aspects of culture are: 

  1. inclusion
  2. intellectual openness 
  3. technical excellence 
  4. focus on the business problem
  5. interest in doing the right thing
  6. accountability
  7. hard work  

To build a culture of technical excellence and intellectual openness, I use the following techniques:

  • Probe everything. Ask why. Ask why again.
  • Ask people to explain things as simply as possible.
  • For anything complex, ask people to write things down in plain English.
  • Get perspective from people with different technical competencies.
  • Frontload your thinking on a project.
  • Lead a meeting to ‘think together’ about projects where tough, thoughtful, probing is encouraged.
  • Establish guidelines for coding excellence and peer review each major PR. 
  • Lead a meeting where different members of the team teach each other — I call it a ‘learn together.’

To make sure that the team has a place to share and reflect on the understanding, I also work hard to build a shared understanding. Building a shared understanding of the space helps drive clarity about actions. It also provides a jumping-off point for planning. Without knowing what you know, you start each planning cycle like it is groundhog day. 

I actively use Notion to think through the broader problem and the sub-problems and spend time organizing and grooming it. Build something that inspires others to build that shared understanding.

I encourage active discussion and shared ownership. We own each page as a team, if not as a company. But eventually, one person is listed as a primary point of contact who takes the responsibility of keeping it updated.

To drive accountability, I drive transparency. For that, I use the following methods:

  1. Share folders. Shared team Google drive folder with project-level folders nested within it. I dump most documents that are sent to me in the shared folder and organize the folder periodically. There is even a personal growth folder where I store things like ‘how to write an effective review,’ etc.
  2. Be a Good Router: I encourage people to err toward spamming. I catch whatever failures by 1. forwarding relevant emails, 2. Posting attachments to Google Drive, and 3. posting to the relevant section on Notion.
  3. Visibility on intermediate outputs: I encourage everyone to post intermediate outputs, questions, concerns, etc. to the team channel, and encourage people to share their final outputs widely.
  4. Use PPP: Each member of the team shares a Progress, Plans, and Problems each week along with links to outputs. 

Culture allows you to produce great things if you have the right process and product focus. There are some principles of product and process management that I follow. My tenets of product management are: 

  • Formalize the business problem 
  • Define the API before moving too far out.
  • Scope and execute on a V1 than before executing on the full solution. Give yourself the opportunity to learn from the world.
  • Do less. Do well.

The process for product development that I ask the team to follow can be split into planning, execution, and post-execution.

Planning: What to Work On?

  1. Try to understand the question and the use case as best as you can. Get as much detail as possible.
  2. Talk to at least one more person on the team and ask them to convince you why you shouldn’t do this followed by if you are doing this, what are the other things I am missing here.
  3. For larger projects, write a document and get feedback from all the stakeholders.
  4. As needed, hold a think together with a diverse, cross-functional group. Keep the group small enough. Generally, I have found ~5—8 is the sweet spot.
  5. Get an ROI estimate and rank against other ideas.

How to execute?

  1. Own your project on whatever project management software you use.
  2. Own communication on where things are on the project, new discoveries, new challenges, etc.
  3. Ask people to review each commit. Treat reviewing seriously. Generally, the review process that works well is people explaining each line to another person. In the act of explaining, you catch your hidden assumptions, etc.
  4. Come up with at least a few sanity checks for your code.

Communicating Within and Outside What You Learn

  1. Share discoveries aggressively. Err on the side of over-communication.
  2. Contribute to Notion so that there is always a one-stop-shop around how we understand X. 
  3. Create effective presentations

Comparing Ad Targeting Regimes

30 Aug

Ad targeting is often useful when you have multiple things to sell (opportunity cost) or when the cost of running an ad is non-trivial or when an irrelevant ad reduces your ability to reach the user later or any combination of the above. (For a more formal treatment, see here.)

But say that you want proof—you want to estimate the benefit of targeting. How would you do it?

When there is one product to sell, some people have gone about it as follows: randomize to treatment and control, show the ad to a random subset of respondents in the control group and an equal number of respondents picked by a model in the treatment group, and compare the outcomes of the two groups (it reduces to comparing subsets unless there are spillovers). This experiment can be thought off as a way to estimate how to spend a fixed budget optimally. (In this case, the budget is the number of ads you can run.) But if you were interested in finding out whether a budget allocated by a model would be more optimal than say random allocation, you don’t need an experiment (unless there are spillovers). All you need to do is show the ad to a random set of users. For each user, you know whether or not they would have been selected to see an ad by the model. And you can use this information to calculate payoffs for the respondents chosen by the model, and for the randomly selected group.

Let me expand for clarity. Say that you can measure profit from ads using CTR. Say that we have built two different models for selecting people to whom we should show ads—Model A and Model B. Now say that we want to compare which model yields a higher CTR. We can have four potential scenarios for selection of respondents by the model:

model_a, model_b
0, 0
1, 0
0, 1
1, 1

For CTR, 0-0 doesn’t add any information. It is the conditional probability. To measure which of the models is better, draw a fixed size random sample of users picked by model_a and another random sample of the same size from users picked by model_b and compare CTR. (The same user can be picked twice. It doesn’t matter.)

Now that we know what to do, let’s understand why experiments are wasteful. The heuristic account is as follows: experiments are there to compare ‘similar people.’ When estimating allocative efficiency of picking different sets of people, we are tautologically comparing different people. That is the point of the comparison.

All this still leaves the question of how would we measure the benefit of targeting. If you had only one ad to run and wanted to choose between showing an advertisement to everyone versus fewer people, then show the ad to everyone and estimate profits based on the rows selected in the model and profits from showing the ad to everyone. Generally, showing an ad to everyone will win.

If you had multiple ads, you would need to randomize. Assign each person in the treatment group to a targeted ad. In the control group, you could show an ad for a random product. Or you could show an advertisement for any one product that yields the maximum revenue. Pick whichever number is higher as the one to compare against.

What’s Relevant? Learning from Organic Growth

26 Aug

Say that we want to find people to whom a product is relevant. One way to do that is to launch a small campaign advertising the product and learn from people who click on the ad, or better yet, learn from people who not just click on the ad but go and try out the product and end up using it. But if you didn’t have the luxury of running a small campaign and waiting a while, you can learn from organic growth.

Conventionally, people learn from organic growth by posing it as a supervised problem. And they generate the labels as follows: people who have ‘never’ (mostly: in the last 6–12 months) used the product are labeled as 0 and people who “adopted” the product in the latest time period, e.g., over the last month, are labeled 1. People who have used the product in the last 6–12 months or so are filtered out.

There are three problems with generating labels this way. First, not all the people who ‘adopt’ a product continue to use the product. Many of the people who try it find that it is not useful or find the price too high and abandon it. This means that a lot of 1s are mislabeled. Second, the cleanest 1s are the people who ‘adopted’ the product some time ago and have continued to use it since. Removing those is thus a bad idea. Third, the good 0s are those who tried the product but didn’t persist with it not those who never tried the product. Generating the labels in such a manner also allows you to mitigate one of the significant problems with learning from organic growth: people who organically find a product are different from those who don’t. Here, you are subsetting on the kinds of people who found the product, except that one found it useful and another did not. This empirical strategy has its problems, but it is distinctly better than the conventional approach.

Sometimes Scientists Spread Misinformation

24 Aug

To err is human. Good scientists are aware of that, painfully so. The model scientist obsessively checks everything twice over and still keeps eyes peeled for loose ends. So it is a shock to learn that some of us are culpable for spreading misinformation.

Ken and I find that articles with serious errors, even articles based on fraudulent data, continue to be approvingly cited—cited without any mention of any concern—long after the problems have been publicized. Using a novel database of over 3,000 retracted articles and over 74,000 citations to these articles, we find that at least 31% of the citations to retracted articles happen a year after the publication of the retraction notice. And that over 90% of these citations are approving.

What gives our findings particular teeth is the role citations play in science. Many, if not most, claims in a scientific article rely on work done by others. And scientists use citations to back such claims. The readers rely on scientists to note any concerns that impinge on the underlying evidence for the claim. And when scientists cite problematic articles without noting any concerns they very plausibly misinform their readers.

Though 74,000 is a large enough number to be deeply concerning, retractions are relatively infrequent. And that may lead some people to discount these results. Retractions may be infrequent but citations to retracted articles post-retraction are extremely revealing. Retractions are a low-low bar. Retractions are often a result of convincing evidence of serious malpractice, generally fraud or serious error. Anything else, for example, a serious error in data analysis, is usually allowed to self-correct. And if scientists are approvingly citing retracted articles after they have been retracted, it means that they have failed to hurdle the low-low bar. Such failure suggests a broader malaise.

To investigate the broader malaise, Ken and I exploited data from an article published in Nature that notes a statistical error in a series of articles published in prominent journals. Once again, we find that approving citations to erroneous articles persist after the error has been publicized. After the error has been publicized, the rate of citation to erroneous articles is, if anything, higher, and 98% of the citations are approving.

In all, it seems, we are failing.

The New Unit of Scientific Production

11 Aug

One fundamental principle of science is that there is no privileged observer. You get to question what people did. But to question, you first must know what people did. So part of good scientific practice is to make it easy for people to understand how the sausage was made—how the data were collected, transformed, and analyzed—and ideally, why you chose to make the sausage that particular way. Papers are ok places for describing all this, but we now have better tools: version controlled repositories with notebooks and readme files.

The barrier to understanding is not just lack of information, but also poorly organized information. There are three different arcs of information: cross-sectional (where everything is and how it relates to each other), temporal (how the pieces evolve over time), and inter-personal (who is making the changes). To be organized cross-sectionally, you need to be macro organized (where is the data, where are the scripts, what do each of the scripts do, how do I know what the data mean, etc.), and micro organized (have logic and organization to each script; this also means following good coding style). Temporal organization in version control simply requires you to have meaningful commit messages. And inter-personal organization requires no effort at all, beyond the logic of pull requests.

The obvious benefits of this new way are known. But what is less discussed is that this new way allows you to critique specific pull requests and decisions made in certain commits. This provides an entirely new way to make progress in science. The new unit of science also means that we just don’t dole out credits in crude currency like journal articles but we can also provide lower denominations. We can credit each edit, each suggestion. And why not. The third big benefit is that we can build epistemological trees where the logic of disagreement is clear.

The dead tree edition is dead. It is also time to retire the e-version of the dead tree edition.

Quality Data: Plumbing ML Data Pipelines

6 Aug

What’s the difference between a scientist and a data scientist? Scientists often collect their own data, and data scientists often use data collected by other people. That is part jest but speaks to an important point. Good scientists know their data. Good data scientists must know their data too. To help data scientists learn about the data they use, we need to build systems that give them good data about the data. But what is good data about the data? And how do we build systems that deliver that? Here’s some advice (tailored toward rectangular data for convenience):

  • From Where, How Much, and Such
    • Provenance: how were each of the columns in the data created (obtained)? If the data are derivative, find out the provenance of the original data. Be as concrete as possible, linking to scripts, related teams, and such.
    • How Frequently is it updated
    • Cost per unit of data, e.g., a cell in rectangular data.
    Both, the frequency with which data are updated, and the cost per unit of data may change over time. Provenance may change as well: a new team (person) may start managing data. So the person who ‘owns’ the data must come back to these questions every so often. Come up with a plan.
  • What? To know what the data mean, you need a data dictionary. A data dictionary explains the key characteristics of the data. It includes:
    1. Information about each of the columns in plain language.
    2. How were the data were collected? For instance, if you conducted a survey, you need the question text and the response options (if any) that were offered, along with the ‘mode’, and where in a sequence of questions does this lie, was it alone on the screen, etc.
    3. Data type
    4. How (if at all) are missing values generated?
    5. For integer columns, it gives the range, sd, mean, median, n_0s, and n_missing. For categorical, it gives the number of unique values, what each label means, and a frequency table that includes n_missing (if missing can be of multiple types, show a row for each).
    6. The number of duplicates in data and if they are allowed and a reason for why you would see them. 
    7. Number of rows and columns
    8. Sampling
    9. For supervised models, store correlation of y with key x_vars
  • What If? What if you have a question? Who should you bug? Who ‘owns’ the ‘column’ of data?

Store these data in JSON so that you can use this information to validate against. Produce the JSON for each update. You can flag when data are some s.d. above below last ingest.

Store all this metadata with the data. For e.g., you can extend the dataframe class in Scala to make it so.

Auto-generate reports in markdown with each ingest.

In many ML applications, you are also ingesting data back from the user. So you need the same as above for the data you are getting from the user (and some of it at least needs to match the stored data). 

For any derived data, you need the scripts and the logic, ideally in a notebook. This is your translation function.

Where possible, follow the third normal form of databases. Only store translations when translation is expensive. Even then, think twice.

Lastly, some quality control. Periodically sit down with your team to see if you should see what you are seeing. For instance, if you are in the survey business, do the completion times make sense? If you are doing supervised learning, get a random sample of labels. Assess their quality. You can also assess the quality by looking at errors in classification that your supervised model makes. Are the errors because the data are mislabeled? Keep iterating. Keep improving. And keep cataloging those improvements. You should be able to ‘diff’ data collection, not just numerical summaries of data. And with what the method I highlight above, you should be.

Operating Efficiently: Thumb Rules for Increasing Operational Efficiency

5 Aug

ABC ships cereal to people. ABC has a large operations team that handles customer complaints, e.g., “I got the wrong kind of cereal,” “the cereal was too old,” “the cereal arrived too late,” etc., and custom requests, e.g., “I would like seventy custom boxes shipped to a company retreat”, “I would like the delivery date to be changed,” etc. ABC is interested in providing customer service at a lower cost. What are its options? Here are some thumb rules:

  1. Prevent Work: 
    1. Prevent complaints from arising. Prevention will cost money so it is tempting to think of it as a trade-off. In the long term, prevention is generally financially beneficial.  
    2. Self-Serve: Build tools that allow customers to self-serve. It can be a win-win.
  2. Convert Externalities to Internalities: What special favors are customers asking that are not part of the price? For instance, are customers contacting you for changing delivery dates? Are you charging them for such changes? Bottom line: do not provide services that people are not willing to pay for.
  3. Staff Appropriately
    1. Forecast different kinds of work (by different work we mean work for which you pay different amounts of money and need to hire different people or train differently), come up with ideal shifts, and incentives for staying longer or going home sooner when reality doesn’t match up to reality. If you can forecast months in advance, it can inform your hiring or ‘right-sizing’ plans.
    2. Reduce Specialization: One thing that gets in the way of reducing staffing in having a lot of specialization. 
    3. Smooth Work by Separating Urgent from Non-Urgent Work: Say that a lot of work arrives in a narrow window. Not all of it is urgent. Build tools like ‘call me back’ to deal with non-essential work.  
    4. Simplify Work: Make sure that you don’t need to train people a lot to do the work.
  4. Make People More Efficient
    1. Train: Train people so that they can get more done per unit of time.
    2. Incentivize: Make sure workers and managers are optimally incentivized.
    3. Better Tools and Processes: Invest in tools and processes that help people do the job quicker. For instance, building tools that allow you to seamlessly transfer work between shifts by conveying all the relevant info. 
    4. Prioritize Work: For the same resources, one way to provide better quality is to prioritize work correctly.
  5. Hire more efficient people and fire inefficient people.
  6. Reduce Work: Automate work that can be automated. It includes semi-automation: automating portions of work.

Optimal Sequence in Which to Service Orders

27 Jul

What is the optimal order in which to service orders assuming a fixed budget?

Let’s assume that we have to service orders o_1,…,…o_n, with the n orders iterated by i. Let’s also assume that for each service order, we know how the costs change over time. For simplicity, let’s assume that time is discrete and portioned in units of days. If we service order o_i at time t, we expect the cost to be c_it. Each service order also has an expiration time, j, after which the order cannot be serviced. The cost at expiration time, j, is the cost of failure and denoted by c_ij.

The optimal sequence of servicing orders is determined by expected losses—service the order first where the expected loss is the greatest. This leaves us with the question of how to estimate expected loss at time t. To come up with an expectation, we need to sum over some probability distribution. For o_it, we need the probability, p_it, that we would service o_i at t+1 till j. And then, we need to multiply p_it with c_ij. So framed, the expected loss for order i at time t =
c_it – \Sigma_{t+1}_{j} p_it * c_it

However, determining p_it is not straightforward. New items are added to the queue at t+1. On the flip side, we also get to re-prioritize at t+1. The question is if we will get to the item o_i at t+1? (It means p_it is 0 or 1.) For that, we need to forecast the kinds of items in the queue tomorrow. One simplification is to assume that items in the queue today are the same that will be in the queue tomorrow. Then, it reduces to estimating the cost of punting each item again tomorrow, sorting based on the costs at t+1, and checking whether we will get to clear the item. (We can forgo the simplification by forecasting our queue tomorrow, and each day after that till j for each item, and calculating the costs.)

If the data are available, we can tack on clearing time per order and get a better answer to whether we will clear o_it at time t or not.

Pride and Prejudice

14 Jul

It is ‘so obvious’ that policy A >> policy B that only who don’t want to know or who want inferior things would support policy B. Does this conversation remind you of any that you have had? We don’t just have such conversations about policies. We also have them about people. Way too often, we are being too harsh.

We overestimate how much we know. We ‘know know’ that we are right, we ‘know’ that there isn’t enough information in the world that will make us switch to policy B. Often, the arrogance of this belief is lost on us. As Kahneman puts it, we are ‘ignorant of our own ignorance.’ How could it be anything else? Remember the aphorism, “the more you know, the more you know you don’t know”? The aphorism may not be true but it gets the broad point right. The ignorant are overconfident. And we are ignorant. The human condition is such that it doesn’t leave much room for being anything else (see the top of this page).

Here’s one way to judge your ignorance (see here for some other ideas). Start by recounting what you know. Sit in front of a computer and type it up. Go for it. And then add a sentence about how do you know. Do you recall reading any detailed information about this person or issue? From where? Would you have bought a car if you had that much information about a car?

We not just overestimate what we know, we also underestimate what other people know. Anybody with different opinions must know less than I. It couldn’t be that they know more, could it?

Both, being overconfident about what we know and underestimating what other people know leads to the same thing: being too confident about the rightness of our cause and mistaking our prejudices for obvious truths.

George Carlin got it right. “Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?” It seems the way we judge drivers is how we judge everything else. Anyone who knows less than you is either being willfully obtuse or an idiot. And those who know more than you just look like ‘maniacs.’

Optimal Sequence in Which to Schedule Appointments

1 Jul

Say that you have a travel agency. Your job is to book rooms at hotels. Some hotels fill up more quickly than others, and you want to figure out which hotels to book at first so that your net booking rate is as high as it could be the staff you have.

The logic of prioritization is simple: prioritize those hotels where the expected loss if you don’t book now is the largest. The only thing we need to do is find a way to formalize the losses. Going straight to formalization is daunting. A toy example helps.

Imagine that there are two hotels Hotel A and Hotel B where if you call 2-days and 1-day in advance, the chances of successfully booking a room are .8 and .8, and .8 and .5 respectively. You can only make one call a day. So it is Hotel A or Hotel B. Also, assume that failing to book a room at Hotel A and Hotel B costs the same.

If you were making a decision 1-day out on which hotel to call to book, the smart thing would be to choose Hotel A. The probability of making a booking is larger. But ‘larger’ can be formalized in terms of losses. Day 0, the probability goes to 0. So you make .8 units of loss with Hotel A and .5 with Hotel B. So the potential loss from waiting is larger for Hotel A than Hotel B.

If you were asked to choose 2-days out, which one should you choose? In Hotel A, if you forgo 2-days out, your chances of successfully booking a room next day are .8. At Hotel B, the chances are .5. Let’s play out the two scenarios. If we choose to book at Hotel A 2-days out and Hotel B 1-day out, our expected batting average is (.8 + .5)/2. If we choose the opposite, our batting average is (.8 + .8)/2. It makes sense to choose the latter. Framed as expected losses, we go from .8 to .8 or 0 expected loss for Hotel A and .3 expected loss for Hotel B. So we should book Hotel B 2-days out.

Now that we have the intuition, let’s move to 3-days, 2-days, and 1-day out as that generalizes to k-days out nicely. To understand the logic, let’s first work out a 101 probability question. Say that you have two fair coins that you toss independently. What is the chance of getting at least one head? The potential options are HH, HT, TH, and TT. The chance is 3/4. Or 1 minus the chance of getting a TT (or two failures) or 1- .5*.5.

The 3-days out example is next. See below for the table. If you miss the chance of calling Hotel A 3-days out, the expected loss is the decline in success in booking 2-days or 1-day out. Assume that the probabilities 2-days out and 1-day our are independent and it becomes something similar to the example about coins. The probability of successfully booking 2-days and 1-days out is thus 1 – the probability of failure. Calculate expected losses for each and now you have a way to which Hotel to call on Day 3.

|       | 3-day | 2-day | 1-day |
|-------|-------|-------|-------|
| Hotel A | .9    | .9    | .4    |
| Hotel B | .9    | .9    | .9    |

In our example, the number for Hotel A and Hotel B come to 1 – (1/10)*(6/10) and 1 – (1/10)*(1/10) respectively. Based on that, we should call Hotel A 3-days out before we call Hotel B.

Code 44: How to Read Ahler and Sood

27 Jun

This is a follow-up to the hilarious Twitter thread about the sequence of 44s. Numbers in Perry’s 538 piece come from this paper.

First, yes 44s are indeed correct. (Better yet, look for yourself.) But what do the 44s refer to? 44 is the average of all the responses. When Perry writes “Republicans estimated the share at 46 percent,” (we have similar language in the paper, which is regrettable as it can be easily misunderstood), it doesn’t mean that every Republican thinks so. It may not even mean that the median Republican thinks so. See OA 1.7 for medians, OA 1.8 for distributions, but see also OA 2.8.1, Table OA 2.18, OA 2.8.2, OA 2.11 and Table OA 2.23.

Key points =

1. Large majorities overestimate the share of party-stereotypical groups in the party, except for Evangelicals and Southerners.

2. Compared to what people think is the share of a group in the population, people still think the share of the group in the stereotyped party is greater. (But how much more varies a fair bit.)

3. People also generally underestimate the share of counter-stereotypical groups in the party.

Automating Understanding, Not Just ML

27 Jun

Some of the most complex parts of Machine Learning are largely automated. The modal ML person types in simple commands for very complex operations and voila! Some companies, like Microsoft (Azure) and DataRobot, also provide a UI for this. And this has generally not turned out well. Why? Because this kind of system does too little for the modal ML person and expects too much from the rest. So the modal ML person doesn’t use it. And the people who do use it, generally use it badly. The black box remains the black box. But not much is needed to place a lamp in this black box. Really, just two things are needed:

1. A data summarization and visualization engine, preferably with some chatbot feature that guides people smartly through the key points, including the problems. For instance, start with univariate summaries, highlighting ranges, missing data, sparsity, and such. Then, if it is a supervised problem, give people a bunch of loess plots or explain the ‘best fitting’ parametric approximations with y in plain English, such as, “people who eat 1 more cookie live 5 minutes shorter on average.”

2. An explanation engine, including what the explanations of observational predictions mean. We already have reasonable implementations of this.

When you have both, you have automated complexity thoughtfully, in a way that empowers people, rather than create a system that enables people to do fancy things badly.