Deliberation as Tautology

18 Jun

We take deliberation to be elevated discussion, meaning at minimum, discussion that is (1) substantive, (2) inclusive, (3) responsive, and (4) open-minded. That is, (1) the participants exchange relevant arguments and information. (2) The arguments and information are wide-ranging in nature and policy implications—not all of one kind, not all on one side. (3) The participants react to each other’s arguments and information. And (4) they seriously (re)consider, in light of the discussion, what their own policy attitudes should be.

Deliberative Distortions?

One way to define deliberation would be: “the extent to which the discussion is substantive, inclusive, responsive, and open-minded.” But here, we state the top-end of each as the minimum criteria. So defined, deliberation runs into two issues:

1. It’s posited beneficient effects become becomes a near tautology. If the discussion meets that high bar, how could it not refine preferences?

2. The bar for what counts as deliberation is high enough that I doubt that most deliberative mini-publics come anywhere close to meeting the ideal.

The Value of Bad Models

18 Jun

This is not a note about George Box’s quote about models. Neither is it about explainability. The first is trite. And the second is a mug’s game.

Imagine the following: you get hundreds of emails a day, and someone must manually sort which emails are urgent and which are not. The process is time-consuming. So you want to build a model. You estimate that a model with an error rate of 5% or less will save time—the additional work from addressing the erroneous five will be outweighed by the “free” correct classification of the other 95.

Say that you build a model. And if you dichotomize at p = .5, the model accurately classifies 70% of all emails. Even though the accuracy is less than 95%, should we put the model in production?

Often, the answer is yes. When you put such a model in production, it generally saves effort right away. Here’s how. If you get people to (continue to) manually classify the emails that the model is uncertain about, say with p-values between .3 and .7, the accuracy of the model on the rest of rows is generally vastly higher. More generally, you can choose the cut-offs for which humans need to code in a way that reduces the error to an acceptable level. And then use a hybrid approach to capitalize on the savings and like Matthew 22:21, render to model the region where the model does well, and to humans the rest.

Snakes on Ladders: Encouraging People to Climb the Engagement Ladder

3 Jun

Marketers love engagement ladders. To increase engagement with a product, many companies segment their users based on usage, for instance, into heavy (super), medium (average), and light, and prod their users to climb the ladder by suggesting they do things that people in the segment above them are doing and which they aren’t doing (as frequently).

At first blush, it sounds reasonable, even obvious. The trouble with the seemingly obvious, however, is that a) it gives the illusion of understanding, which prevents us from thinking carefully (because there is nothing more to understand!), and b) it doesn’t always make sense.

Let’s start by assuming that the ladder metaphor makes sense. The only thing that we need to do is to implement it correctly.

The ladder metaphor is built on the idea of stable rungs. If the classification into “light”, “medium”, and “heavy” is not durable—for instance, if someone classified as “heavy” can move to “light” next month on their own accord—what we learn by comparing “heavy” users to “medium” users may prove deleterious for the “medium” users.

Thus, it is useful to have stable rungs. To build stable rungs, start by assessing the stability of rungs by building transition matrices over time. If the rungs are not durable over time frames over which you want to see an effect, bolster them by extending the observation time over which usage is measured or using multiple measures. For instance, if usage over the last month does not produce durable rungs, it may be because usage is heavily seasonal. To fix that, switch to usage over multiple months or a seasonally adjusted number.

Once you have stable rungs, the next task is to come up with a set of actions that marketers can encourage users to take. The popular method to arbitrate between potential actions is to regress adjacent rungs on the set of potential actions and find the ones that are most highly correlated or have the highest beta. The popular method may seem reasonable but it isn’t. Assume away causality and you still care about how useful, actionable, and easy a recommended action is. The highest beta doesn’t mean the lowest cost per incremental improvement (again, assuming away causal concerns and taking betas at face value). And there is no way to address such concerns without experimenting and finding out what works best. (The message that works the best is a sum of the action being recommended and how that action is being encouraged.)

There is one minor nuance to the above. It pays to have ‘no action’ as an action if ‘no action’ isn’t your control group. Usage-based sorting merely sorts the users by kinds of people—by people who don’t need to use the product more often than thrice a month versus those who do. Who are we to say that they need to use the product more? Fact is that often enough the correlation between usage and retention is small. And doing nothing may prove better than annoying people with unwanted emails.

Lastly, the ladder metaphor leads some to believe that we need to stand up the same ladder for everyone. Using the highest beta or the most effective treatment means recommending the same (best) action to everyone. This is what I call the ‘mail merge’ heuristic. Mail merge is plausibly very highly correlated with the usage of MS-Word. But it would be an utter disaster if MSFT recommended it to me—I plan to quit the MSFT ecosystem if it comes to pass. Ideally, we want to encourage people to cross rungs by using more things in the software that are useful for them. (In fact, it isn’t clear how else we can induce a user to use the software more.) You can learn different ladders by modeling heterogeneity in treatment effects and then use simple algebra to find the best one for each person.

Why do We Fail? And What to do About It?

28 May

I recently read Gawande’s The Checklist Manifesto. (You can read my review of the book here and my notes on the book here.) The book made me think harder about failure and how to prevent it. Here’s a result of that thinking.

We fail because we don’t know or because we don’t execute on what we know (Gorovitz and MacIntyre). Of the things that we don’t know are things that no else knows either—they are beyond humanity’s reach for now. Ignore those for now. This leaves us with things that “we” know but the practitioner doesn’t.

Practitioners do not know because the education system has failed them, because they don’t care to learn, or because the production of new knowledge outpaces their capacity to learn. Given that, you can reduce ignorance by 1) increase the length of training, b) improving the quality of training, c) setting up continued education, d) incentivizing knowledge acquisition, e) reducing the burden of how much to know by creating specializations, etc. On creating specialties, Gawande has a great example: “there are pediatric anesthesiologists, cardiac anesthesiologists, obstetric anesthesiologists, neurosurgical anesthesiologists, …”

Ignorance, however, ought not to damn the practitioner to error. If you know that you don’t know, you can learn. Ignorance, thus, is not a sufficient condition for failure. But ignorance of ignorance is. To fix overconfidence, leading people through provocative, personalized examples may prove useful.

Ignorance and ignorance about ignorance are but two of the three reasons for why we fail. We also fail because we don’t execute on what we know. Practitioners fail to apply what they know because they are distracted, lazy, have limited attention and memory, etc. To solve these issues, we can a) reduce distractions, b) provide memory aids, c) automate tasks, d) train people on the importance of thoroughness, e) incentivize thoroughness, etc.

Checklists are one way to work toward two inter-related aims: educating people about the necessary steps needed to make a decision and aiding memory. But awareness of steps is not enough. To incentivize people to follow the steps, you need to develop processes to hold people accountable. Audits are one way to do that. Meetings set up at appropriate times during which people go through the list is another way.

Wanted: Effects That Support My Hypothesis

8 May

Do survey respondents account for the hypothesis that they think people fielding the survey have when they respond? The answer, according to Mummolo and Peterson, is not much.

Their paper also very likely provides the reason why—people don’t pay much attention. Figure 3 provides data on manipulation checks—the proportion guessing the hypothesis being tested correctly. The change in proportion between control and treatment ranges from -.05 to .25, with a bulk of changes in Qualtrics between 0 and .1. (In one condition, authors even offer an additional 25 cents to give a result consistent with the hypothesis. And presumably, people need to know the hypothesis before they can answer in line with it.) The faint increase is especially noteworthy given that on average, the proportion of people in the control group who guess the hypothesis correctly—without the guessing correction—is between .25–.35 (see Appendix B; pdf).

So, the big thing we may have learned from the data is how little attention survey respondents pay. The numbers obtained here are similar to those in Appendix D of Jonathan Woon’s paper (pdf). The point is humbling and suggests that we need to: a) invest more in measurement, and b) have yet larger samples, which is an expensive way to overcome measurement error—a point Gelman has made before.

There is also the point about the worthiness of including ‘manipulation checks.’ Experiments tell us ATE of what we manipulate. The role of manipulation checks is to shed light on ‘compliance.’ If conveying experimenter demand clearly and loudly is a goal, then the experiments included probably failed. If the purpose was to know whether clear but not very loud cues about ‘demand’ matter—and for what it’s worth, I think it is a very reasonable goal; pushing further, in my mind, would have reduced the experiment to a tautology—the paper provides the answer.

Interview with InfoQ

26 Apr

I recently gave an interview to InfoQ about my paper (and associated open source software) on predicting the race and ethnicity of a person using the sequence of characters in a name.

Here a relevant excerpt:

InfoQ: Can you discuss how we can learn from names? What ML/DL algorithms can we use?

Gaurav Sood:  Learning more about a person from their name is no different from tackling any other supervised ML problem. It all starts with getting (or creating) a large labeled corpus. For instance, one key innovation in ethnicolr is the training data—we use voting registration files to get a large labeled corpus. In another project on learning from names, I scraped Google Image Search results to build the training data for inferring the gender from a name.

Once you have the data, find ways to exploit patterns in the data to learn a model. Some early ventures exploited the fact that names of different kinds of people began/ended differently. For instance, female names in India often end with an ‘a,’ and you can exploit that pattern to infer gender from Indian names. In ethnicolr, we generalize this intuition and use patterns in sequences of characters. (I am also working on exploiting sequences of sounds.) Like Ye et al., you could also rely on the fact that we correspond more frequently with co-ethnics and exploit email networks for building your models.

To exploit the patterns in the data, the full-range of DL/ML tools is available to you. Use what works best.

Estimating the Trend at a Point in a Noisy Time Series

17 Apr

Trends in time series are valuable. If the cost of a product rises suddenly, it likely indicates a sudden shortfall in supply or a sudden rise in demand. If the cost of claims filed by a patient rises sharply, it plausibly suggests rapidly worsening health.

But how do we estimate the trend at a particular time in a noisy time series? The answer is simple: smooth the time series using any one of the many methods, local polynomials or via GAMs or similar such methods, and then estimate the derivative(s) of the function at the chosen point in time. Smoothing out the noise is essential. If you don’t smooth and instead go with a naive estimate of the derivative, it can be heavily negatively correlated with derivatives gotten from smoothed time series. For instance, in an example we present, the correlation is –.47.

Clarification

Sometimes we want to know what the “trend” was over a particular time window. But what that means is not 100% clear. For a synopsis of the issues, see here.

Python Package

incline provides a couple of ways of approximating the underlying function for the time series:

  • fitting a local higher order polynomial via Savitzky-Golay over a window of choice
  • fitting a smoothing spline

The package provides a way to estimate the first and second derivative at any given time using either of those methods. Beyond these smarter methods, the package also provides a way a naive estimator of slope—average change when you move one-step forward (step = observed time units) and one-step backward. Users can also calculate the average or maximum slope over a time window (over observed time steps).

Rocks and Scissors for Papers

17 Apr

Zach and Jack* write:

What sort of papers best serve their readers? We can enumerate desirable characteristics: these papers should

(i) provide intuition to aid the reader’s understanding, but clearly distinguish it from stronger conclusions supported by evidence;

(ii) describe empirical investigations that consider and rule out alternative hypotheses [62];

(iii) make clear the relationship between theoretical analysis and intuitive or empirical claims [64]; and

(iv) use language to empower the reader, choosing terminology to avoid misleading or unproven connotations, collisions with other definitions, or conflation with other related but distinct concepts [56].

Recent progress in machine learning comes despite frequent departures from these ideals. In this paper, we focus on the following four patterns that appear to us to be trending in ML scholarship:

1. Failure to distinguish between explanation and speculation.

2. Failure to identify the sources of empirical gains, e.g. emphasizing unnecessary modifications to neural architectures when gains actually stem from hyper-parameter tuning.

3. Mathiness: the use of mathematics that obfuscates or impresses rather than clarifies, e.g. by confusing technical and non-technical concepts.

4. Misuse of language, e.g. by choosing terms of art with colloquial connotations or by overloading established technical terms.

Funnily Zach and Jack fail to take their own advice, forgetting to distinguish between anecdotal evidence (they claim a ‘troubling trend’ without presenting systematic evidence for it). But the points they make are compelling. The second and third points are especially applicable to economics though they apply to a lot of scientific production.


* It is Zachary and Jacob.

What Clicks With the Users? Maximizing CTR

17 Apr

Given a pool of messages, how can you maximize CTR?

The problem of maximizing CTR reduces to the problem of estimating the probability that a person in a specific context will click on each of the messages. Once you have the probabilities, all you need to do is apply the max operator and show the message with the highest probability. Technically, you don’t need to get the point estimates right—you just need to get the ranking right.

Abstracting out, there are four levers for increasing CTR:

  1. Better models and data: Posed as a supervised problem, we are aiming to learn clicks as a function of a) the kind of content, b) the kind of context, and c) the kinds of people. (And, of course, interactions between all three are included.) To learn preferences well, we need to improve your understanding of the content, context, and kinds of people. For instance, to understanding content more finely, you may need to code font size, font color, etc.
  2. Modeling externalities (user learning): It sounds funny when you say that CTR of a system that shows no messages to some people some of the time can be better than a system that shows at least some message to everyone every time they log in. But it can be true. If you need to increase CTR over longer horizons, you need to be able to model the impact of showing one message on a person opening another message. If you do that, you may realize that the best option is to not even show a message this time. (The other way you could ‘improve’ CTR is by losing people—you may lose people you bombard with irrelevant messages and the only people who ‘survive’ are those who like what you send.)
  3. Experimenting With How to Present a Message: Location on the webpage, the font, etc. all may matter. Experiment to learn.
  4. Portfolio: This let’s go of the fixed portfolio. Increase your portfolio of messages so that you have a reasonable set of things for everyone. It is easy enough to mistake people dismissing a message with disinterest in receiving messages. Don’t make the mistake. If you want to learn where you are failing, find out for which kinds of people you have the lowest (calibrated) probability scores for and think hard about what kinds of messages will appeal to these kinds of people.

Quitting at 40

6 Apr

Recently, I had the pleasure of interviewing Walter Guillioli. Walter is one of those few brave people who has had the courage to take the reins of his life. Walter carefully and smartly worked to save enough to live off the savings and then quit a well-paying job at 40 to live his life.

GS: Tell us a bit more about yourself.

I grew up in a middle-income family in Guatemala. I am the youngest of five. Growing up, I enjoyed getting into trouble.

From a young age, I was taught that education is important. I studied Computer Science in college. And later, I was fortunate to get a full scholarship from the Dutch government to get an MBA.

I worked in Marketing for 10+ years until I got bored and decided to switch careers to data science. I just finished a Master of Science in Data Science from Northwestern while working full-time.

I love animals, the outdoors and the simple things in life like camping and good scenery. I also like to push myself in sports because it humbles me and helps me build character. I got a black belt in Tae Kwon Do at 38, and I am currently training for ultra-running trail races.

GS: Why did you decide to quit working full-time at 40?

WG: It is a combination of factors, but it is mostly a result of intellectual boredom and a desire to spend my time on earth doing things I love, and to not just “survive” life.

I have always questioned the purpose of (my) life and never liked the cycle most follow: study > work > get married & have kids > consume > be “busy” > (maybe get free time at old age) > die.

Professionally, I have done relatively well. Searching “success,” I have found my dream job three times. However, each time I found my “dream job,” the excitement faded away quickly as I spent most of my time surviving meetings and going through the grind of corporate overhead. I never understood all the stress for work that I didn’t think added much value. I love intellectual challenges and good work, but it was hard to find it in a big corporation.

One of my favorite quotes in Spanish translates roughly to “the richest person is not the one that has the most but the one that knows how to desire less.” And between spending my time in a cubicle working on stuff that didn’t matter to me and buying things I didn’t need, I decided to buy my time and freedom to do what I want.

I decided with my wife to live a simpler life and to move closer to nature and the mountains. I decided to spend more time with my family and raise my 2-year old. I decided that each day I will pick what to do – whether it is going for a trail run (I am training for a 52-mile run) or riding my mountain bike or dirt bike or simply walking my dogs for a few hours or playing with my son and wife in a park or just reading a book.

I will work on projects. I will just work on stuff that matters to me. I want to occasionally freelance on data science projects and contribute to the world. I am also considering personal finance advising to help people.

GS: Tell us a bit more about how you planned your retirement.

WG: I never had a master plan. It has been a learning process with mistakes along the way.

The most important thing for me was changing the mindset about money. I never paid much attention to money. I spent it relatively mindlessly. However, after reading articles like this one, I realized that money is a tool to buy my time and freedom. I can’t think of anything better that money can buy.

So, we focused on understanding our expenses and figuring out ways to reduce them. It’s not about being cheap but about spending intentionally. We also started saving and investing as much as possible on index funds. The end goal became having enough money invested that we could cover our annual expenses from its interest.

GS: What’s your advice for people looking to do the same?

WG:

  1. Track and understand your annual expenses with a tool like Quicken or Mint.
  2. Save as much as you can and invest in index funds. Don’t worry about timing the market (it doesn’t work) or about having the perfect portfolio. Start investing in a broad index fund like Vanguard’s VTSAX and get a bit more sophisticated later. Learn more here.
  3. Make a list of things that truly bring you happiness and contrast that with your spending.
  4. Avoid “lifestyle inflation.” And don’t try to keep up with your neighbors. Nothing will be ever enough.
  5. Read these books: Little Book of Common Sense Investing, Simple Path to Wealth, Your Money or Your Life, Four Pillars of Investing.
  6. Read these blogs: Mr. Money Mustache, Mad Fientist
  7. Listen to the ChooseFI podcast.
  8. If you are married, make sure that everyone is onboard.
  9. Have savings targets and automate everything around it so that you pay yourself first.

Citing Working Papers

2 Apr

Public versions of working papers are increasingly the norm. So are citations to them. But there are four concerns with citing working papers:

  1. Peer review: Peer review improves the quality of papers, but often enough it doesn’t catch serious, basic issues. Thus, a lack of peer review is not as serious a problem as is often claimed.
  2. Versioning: Which version did you cite? Often, there is no canonical versioning system. The best we have is tracking which conference was the paper presented at. This is not good enough.
  3. Availability: Can I check the paper, code, and data for a version? Often enough, the answer is no.

The solution to the latter two is to increase transparency through the entire pipeline. For instance, people can check how my paper with Ken has evolved on Github, including any coding errors that have been fixed between versions. (Admittedly, the commit messages can be improved. Better commit messages—plus descriptions—can make it easier to track changes across versions.)

The first point doesn’t quite deserve addressing in that the current system draws an optimistic line on the quality of published papers. Peer review ought not to end when a paper is published in a journal. If we accept that, then all concerns flagged by peers and non-peers can be addressed in various commits or responses to issues and appropriately credited.

A/B Testing Recommendation Systems

1 Apr

Say that you are building a news recommender that lists which relevant news items in each person’s news feed. Say that your first version of the news recommender is a rules-based system that uses signals like how many people in your network have seen the news, how many people in total have read the news, the freshness of the news, etc., and sums up the signals in an arbitrary way to rank news items. Your second version uses the same signals but uses a supervised model to decide on the optimal weights.

Say that you find that the recommendations vary a fair bit between the two systems. But which one is better? To suss that, you conduct an A/B test. But a naive experiment will produce biased estimates of the effect and the s.e. because:

  1. The signals on which your control group ranking system on is based are influenced by the kinds of news articles that people in treatment group see. And vice versa.
  2. There is an additional source of stochasticity in recommendations that people see: the order in which people arrive matters.

The effect of the first concern is that our estimates are likely attenuated.  To resolve the first issue, show people in the Control Group news articles based on predicted views of news articles based on historical data or pro-rated views of people assigned to control group alone. (This adds a bit of noise to the Control Group estimates.) And keep a separate table of input data for the treatment group and apply the ML model to the pro-rated data from that table.

The consequence of the second issue is that our s.e. is very plausibly much larger than what we will get with the split world testing (each condition gets its own table of counts for views, etc.). The sequence in which people arrive matters as it intersects with social influence world. To resolve the second issue, you need to estimate how the sequence of arrival affects outcomes. But given the number of pathways, the best we can probably do is bound. We could probably estimate the effect of ranking the least downloaded item first as a way to bound the effects.

p.s. The social influence world doesn’t report s.e. but this paper based on Salganik/Watts paper reports incorrect ones as it implicitly assumes that the sequence of arrival doesn’t matter.

Advice that works

31 Mar

Writing habits of some writers:

“Early in the morning. A good writing day starts at 4 AM. By 11 AM the rest of the world is fully awake and so the day goes downhill from there.”

Daniel Gilbert

“Usually, by the time my kids get off to school and I get the dogs walked, I finally sit down at my desk around 9:00. I try to check my email, take care of business-related things, and then turn it off by 10:30—I have to turn off my email to get any writing done.”

Juli Berwald

“When it comes to writing, my production function is to write every day. Sundays, absolutely. Christmas, too. Whatever. A few days a year I am tied up in meetings all day and that is a kind of torture. Write even when you have nothing to say, because that is every day.”

Tyler Cowen

“I don’t write everyday. Probably 1-2 times per week.”

Benjamin Hardy

“I’ve taught myself to write anywhere. Sometimes I find myself juggling two things at a time and I can’t be too precious with a routine. I wrote Name of the Devil sitting on a bed in a rented out room in Hollywood while I was working on a television series for A&E. My latest book, Murder Theory, was written while I was in production for a shark documentary and doing rebreather training in Catalina. I’ve written in casinos, waiting in line at Disneyland, basically wherever I have to.”

Andrew Mayne

Should we wake up at 4 am and be done by 11 am as Dan Gilbert does or should we get started at 10:30 am like Juli, near the time Dan is getting done for the day? Should we write every day like Tyler or should we do it once or twice a week like Benjamin? Or like Andrew, should we just work on teaching ourselves to “write anywhere”?

There is a certain tautological aspect to good advice. It is advice that works for you. Do what works for you. But don’t assume that you have been given advice that is right for you or that it is the only piece of advice on that topic. Advice givers rarely point out that the complete set of reasonable things that could work for you is often pretty large and contradictory and that the evidence behind the advice they are giving you is no more than anecdotal evidence with a dash of motivated reasoning.

None of this to say that you should not try hard to follow advice that you think is good. But once you see the larger point, you won’t fret as much when you can’t follow a piece of advice or when the advice doesn’t work for you. As long as you keep trying to get to where you want to be (and of course, even the merit of some wished for end states is debatable), it is ok to abandon some paths, safe in the knowledge that there are generally more paths to get there.

Stemming Link Rot

23 Mar

The Internet gives many things. But none that are permanent. That is about to change. Librarians got together and recently launched https://perma.cc/ which provides a permanent link to stuff.

Why is link rot important?

Here’s an excerpt from a paper by Gertler and Bullock:

“more than one-fourth of links published in the APSR in 2013 were broken by the end of 2014”

If what you are citing evaporates, there is no way to check the veracity of the claim. Journal editors: pay attention!

countpy: Incentivizing more and better software

22 Mar

Developers of Python packages sometimes envy R developers for the simple perks they enjoy, like a reliable web service that gives a reasonable fill-in for the total number of times an R package has been downloaded. To achieve the same, Python developers need to do a Google BigQuery (which costs money) and wait for 30 or so seconds.

Then there are sore spots that are shared by all developers. Downloads are a shallow metric. Developers often want to know how often other people writing software use their package. Without such a number, it is hard to defend against accusations like, “the total number of downloads are unreliable because they can be padded by numerous small releases,” “the total number of downloads doesn’t reflect how often people use the software,” etc. We partly solve this problem for Python developers by providing a website that tallies how often a package is used in repositories on Github, the largest open-source software hosting platform. http://countpy.com provides the total number of times a package has been called in the requirements file and in the import statement in files in Python language repositories. (At the time of writing, the crawl is incomplete.)

The net benefit (loss) of a piece of software is, of course, greater than mere counts of how many people use it directly in the software they build. We don’t yet count indirect use: software that uses software that uses the software of interest. Ideally, we would like to tally the total time saved, the increase in the number of new projects started, projects which wouldn’t have started had the software not been there, impact on style in which other code is written, and such. We may also need to tally the cost of errors in the original software. To the extent that people don’t produce software because they can’t be credited reasonably for it, better metrics about the impact of software can increase the production of software and increase the quality of the software that is being provided.

Searching for Great Conversations

21 Mar

“When was the last time you had a great conversation? A conversation that wasn’t just two intersecting monologues, but when you overheard yourself saying things you never knew you knew, that you heard yourself receiving from somebody words that found places within you that you thought you had lost, and the sense of an eventive conversation that brought the two of you into a different plain and then fourthly, a conversation that continued to sing afterward for weeks in your mind? Conversations like that are food and drink for the soul.”


John O’Donahue h/t David Perell

Siamese Networks for Record Linkage

20 Mar

For the uninitiated:

A siamese neural network consists of twin networks which accept distinct inputs but are joined by an energy function at the top. This function computes some metric between the highest level feature representation on each side. The parameters between the twin networks are tied. Weight tying guarantees that two extremely similar images could not possibly be mapped by their respective networks to very different locations in feature space because each network computes the same function.

One Shot

Replace the word images with two representations of the same record across any two tables and you have an algorithm for producing good distance functions for efficient record linkage. Triplet loss is a natural extension to this. Looking forward to seeing some bottom line results comparing it to generic supervised results, which reminds me of the fact that I am unaware of any large benchmark datasets for the fundamental problem of statistical record linkage.

The Risk of Misunderstanding Risk

20 Mar

Women who participate in breast cancer screening from 50 to 69 live on average 12 more days. This is the best case scenario. Gerd has more such compelling numbers in his book, Calculated Risks. Gerd shares such numbers to launch a front on assault on the misunderstanding of risk. His key point is:

“Overcoming innumeracy is like completing a three-step program to statistical literacy. The first step is to defeat the illusion of certainty. The second step is to learn about the actual risks of relevant eventsand actions. The third step is to communicate the risks in an understandable way and to draw inferences without falling prey to clouded thinking.”

Gerd’s key contributions are on the third point. Gerd identifies three problems with risk communication:

  1. using relative risk than Numbers Needed to Treat (NNT) or absolute risk,
  2. Using single-event probabilities, and
  3. Using conditional probabilities than ‘natural frequencies.’

Gerd doesn’t explain what he means by natural frequencies in the book but some of his other work does. Here’s a clarifying example that illustrates how the same information can be given in two different ways, the second of which is in the form of natural frequencies:

“The probability that a woman of age 40 has breast cancer is about 1 percent. If she has breast cancer, the probability that she tests positive on a screening mammogram is 90 percent. If she does not have breast cancer, the probability that she nevertheless tests positive is 9 percent. What are the chances that a woman who tests positive actually has breast cancer?”

vs.

“Think of 100 women. One has breast cancer, and she will probably test positive. Of the 99 who do not have breast cancer, 9 will also test positive. Thus, a total of 10 women will test positive. How many of those who test positive actually have breast cancer?”

For those in a hurry, here are my notes on the book.

What’s Best? Comparing Model Outputs

10 Mar

Let’s assume that you have a large portfolio of messages: n messages of k types. And say that there are n models, built by different teams, that estimate how relevant each message is to the user on a particular surface at a particular time. How would you rank order the messages by relevance, understood as the probability a person will click on the relevant substance of the message?

Isn’t the answer: use the max. operator as a service? Just using the max. operator can be a problem because of:

a) Miscalibrated probabilities: the probabilities being output from non-linear models are not always calibrated. A probability of .9 doesn’t mean that there is a 90% chance that people will click it.

b) Prediction uncertainty: prediction uncertainty for an observation is a function of the uncertainty in the betas and distance from the bulk of the points we have observed. If you were to randomly draw a 1,000 samples each from the estimated distribution of p, a different ordering may dominate than the one we get when we compare the means.

This isn’t the end of the problems. It could be that the models are built on data that doesn’t match the data in the real world. (To discover that, you would need to compare expected error rate to actual error rate.) And the only way to fix the issue is to collect new data and build new models of it.

Comparing messages based on propensity to be clicked is unsatisfactory. A smarter comparison would take optimize for profit, ideally over the long term. Moving from clicks to profits requires reframing. Profits need not only come from clicks. People don’t always need to click on a message to be influenced by a message. They may choose to follow-up at a later time. And the message may influence more than the person clicking on the message. To estimate profits, thus, you cannot rely on observational data. To estimate the payoff for showing a message, which is equal to the estimated winning minus the estimated cost, you need to learn it over an experiment. And to compare payoffs of different messages, e.g., encourage people to use a product more, encourage people to share the product with another person, etc., you need to distill the payoffs to the same currency—ideally, cash.

Expertise as a Service

3 Mar

The best thing you can say about Prediction Machines, a new book by a trio of economists, is that it is not barren. Most of the growth you see is about the obvious: the big gain from ML is our ability to predict better, and better predictions will change some businesses. For instance, Amazon will be able to move from shopping-and-then-shipping to shipping-and-then-shopping—you return what you don’t want—if it can forecast what its customers want well enough. Or, airport lounges will see reduced business if we can more accurately predict the time it takes to reach the airport.

Aside from the obvious, the book has some untended shrubs. The most promising of them is that supervised algorithms can have human judgment as a label. We have long known about the point. For instance, self-driving cars use human decisions as labels—we learn braking, steering, speed as a function of road conditions. But what if we could use expert human judgment as a label for other complex cognitive tasks? There is already software that exploits that point. Grammarly, for instance, uses editorial judgments to give advice about grammar and style. But there are so many other places where we could exploit this. You could use it to build educational tools that give guidance on better ways of doing something in real-time. You could also use it to reduce the need for experts.

p.s. The point about exploiting the intellectual property of experts deserves more attention.