“When was the last time you had a great conversation? A conversation that wasn’t just two intersecting monologues, but when you overheard yourself saying things you never knew you knew, that you heard yourself receiving from somebody words that found places within you that you thought you had lost, and the sense of an eventive conversation that brought the two of you into a different plain and then fourthly, a conversation that continued to sing afterward for weeks in your mind? Conversations like that are food and drink for the soul.”
John O’Donahue h/t David Perell
For the uninitiated:
A siamese neural network consists of twin networks which accept distinct inputs but are joined by an energy function at the top. This function computes some metric between the highest level feature representation on each side. The parameters between the twin networks are tied. Weight tying guarantees that two extremely similar images could not possibly be mapped by their respective networks to very different locations in feature space because each network computes the same function.One Shot
Replace the word images with two representations of the same record across any two tables and you have an algorithm for producing good distance functions for efficient record linkage. Triplet loss is a natural extension to this. Looking forward to seeing some bottom line results comparing it to generic supervised results, which reminds me of the fact that I am unaware of any large benchmark datasets for the fundamental problem of statistical record linkage.
Women who participate in breast cancer screening from 50 to 69 live on average 12 more days. This is the best case scenario. Gerd has more such compelling numbers in his book, Calculated Risks. Gerd shares such numbers to launch a front on assault on the misunderstanding of risk. His key point is:
“Overcoming innumeracy is like completing a three-step program to statistical literacy. The first step is to defeat the illusion of certainty. The second step is to learn about the actual risks of relevant
events andactions. The third step is to communicate the risks in an understandable way and to draw inferences without falling prey to clouded thinking.”
Gerd’s key contributions are on the third point. Gerd identifies three problems with risk communication:
- using relative risk than Numbers Needed to Treat (NNT) or absolute risk,
- Using single-event probabilities, and
- Using conditional probabilities than ‘natural frequencies.’
Gerd doesn’t explain what he means by natural frequencies in the book but some of his other work does. Here’s a clarifying example that illustrates how the same information can be given in two different ways, the second of which is in the form of natural frequencies
“The probability that a woman of age 40 has breast cancer is about 1 percent. If she has breast cancer, the probability that she tests positive on a screening mammogram is 90 percent. If she does not have breast cancer, the probability that she nevertheless tests positive is 9 percent. What are the chances that a woman who tests positive actually has breast cancer?”
“Think of 100 women. One has breast cancer, and she will probably test positive. Of the 99 who do not have breast cancer, 9 will also test positive. Thus, a total of 10 women will test positive. How many of those who test positive actually have breast cancer?”
For those in a hurry, here are my notes on the book.
Let’s assume that you have a large portfolio of messages: n messages of k types. And say that there are n models, built by different teams, that estimate how relevant each message is to the user on a particular surface at a particular time. How would you rank order the messages by relevance, understood as the probability a person will click on the relevant substance of the message?
Isn’t the answer: use the max. operator as a service? Just using the max.
b) Prediction uncertainty: prediction uncertainty for an observation is a function of the uncertainty in the betas and distance from the bulk of the points we have observed. If you were to randomly draw a 1,000 samples each from the estimated distribution of p, a different ordering may dominate than the one we get when we compare the means.
This isn’t the end of the problems. It could be that the models are built on data that doesn’t match the data in the real world. (To discover that, you would need to compare expected error rate to actual error rate.) And the only way to fix the issue is to collect new data and build new models of it.
Comparing messages based on propensity to be clicked is unsatisfactory. A smarter comparison would take optimize for profit, ideally over the long term. Moving from clicks to profits requires reframing. Profits need not only come from clicks. People don’t always need to click on a message to be influenced by a message. They may choose to follow-up at a later time. And the message may influence more than the person clicking on the message. To estimate profits, thus, you cannot rely on observational data. To estimate the payoff for showing a message, which is equal to the estimated winning minus the estimated cost, you need to learn it over an experiment. And to compare payoffs of different messages, e.g., encourage people to use a product more, encourage people to share the product with another person, etc., you need to distill the payoffs to the same currency—ideally, cash.
The best thing you can say about Prediction Machines, a new book by a trio of economists, is that it is not barren. Most of the growth you see is about the obvious: the big gain from ML is our ability to predict better, and better predictions will change some businesses. For instance, Amazon will be able to move from shopping-and-then-shipping to shipping-and-then-shopping—you return what you don’t want—if it can forecast what its customers want well enough. Or, airport lounges will see reduced business if we can more accurately predict the time it takes to reach the airport.
Aside from the obvious, the book has some untended shrubs. The most promising of them is that supervised algorithms can have human judgment as a label. We have long known about the point. For instance, self-driving cars use human decisions as labels—we learn braking, steering, speed as a function of road conditions. But what if we could use expert human judgment as a label for other complex cognitive tasks? There is already software that exploits that point. Grammarly, for instance, uses editorial judgments to give advice about grammar and style. But there are so many places other we could exploit this. You could use it to build educational tools that gives guidance on better ways of doing something in
p.s. The point about exploiting the intellectual property of experts deserves more attention.
“In the late 1990s, the leading methods caught about 80 percent of fraudulent transactions. These rates improved to 90–95 percent in 2000 and to 98–99.9 percent today. That last jump is a result of machine learning; the change from 98 percent to 99.9 percent has been transformational.
An improvement from 85 percent to 90 percent accuracy means that mistakes fall by one-third. An improvement from 98 percent to 99.9 percent means mistakes fall by a factor of twenty. An improvement of twenty no longer seems incremental.”
From Prediction Machines by Agarwal, Gans, and Goldfarb.
One way to compare the improvements is to compare differences in percentages —5 and 1.9. That is what I would have done. That is so because conditional on the same difference in percentages, lower the base, the greater the multiplicative factor, which makes it a cheap way of making small improvements look better. Even then, for consistency, the comparison would have been between percentage increases in accuracy, between (90 – 85)/85 and (99.9 – 98)/98. But, AGG had to flip the estimand to percentage errors to make the latter relative change look better.
Vegetarians turn at the thought of eating the meat of a cow that has died from a heart attack. The disgust that vegetarians experience is not principled. Nor is the greater opposition to homosexuality that people espouse when they are exposed to foul smell. Haidt uses similar such provocative examples to expose chinks in how we think about what is moral and what is not.
Knowing that what we find disgusting may not always be “disgusting,” that our moral reasoning can be flawed, is a superpower. Because thinking that you are in the right makes you self-righteous. It makes you think that you know all the facts, that you are somehow better. Often, we are not. If we stop conflating disgust with being in the right or indeed, with being right, we shall all get along a lot better.
Faced with mass murder, it is hard to escape the conclusion that life has no meaning. For how could it be that life has meaning when lives matter so little? As a German Jew in a concentration camp, Victor Frankl had to confront that question.
In Man’s Search for Meaning, Frankl gives two answers to the question. His first answer is a reflexive rejection of the meaninglessness of life. Frankl claims that life is “unconditional[ly] meaningful.” There is something to that, but not enough to hang on to for too long. It is also not his big point.
Instead, Frankl has a more nuanced point: “If there is … meaning in life …, then there must be … meaning in suffering.” (Because suffering is an inescapable part of life.) The meaning of suffering, according to him, lies in how we respond to it. Do we suffer
Not only that, the extent of human achievement is: responsibly answering the questions that life asks of us. This means two things. First, that questions about human achievement can only be answered within the context of one’s life. And second, in responsibly answering questions that life asks of us, we attain what humans can ever attain. In a limited life, circumscribed by unavoidable suffering, for instance, the peak of human achievement is keeping dignity. If your life offers you more, then, by all means, do more—derive meaning from
Information on tap is a boon. But if it means that the only thing we will end up knowing—have in your heads—is where to go to find the information, it may also be a bane.
Accessible stored cognitions are vital. They allow us to verify and contextualize new information. If we need to look things up, because of laziness or forgetfulness, we will end up accepting some false statements, which we would have easily refuted had we had the relevant information in our
Information on tap also produces another malaise. It changes the topography of what we know. As search costs go down, people move from learning about a topic systematically to narrowly searching for whatever they need to know, now. And knowledge built on narrow searches looks like Swiss cheese.
Worse, many a time when people find the narrow thing they are looking for, they think that that is all there to know. For instance, in Computer Science and Machine Learning, people can increasingly execute sophisticated things without knowing much. (And that is a mostly a good thing.) But getting something to work—by copying the code from StackOverflow—gives people the sense that they “know.” And when we think we know, we also know that there is not much more to know. Thus, information on tap reduces the horizons of our knowledge about our ignorance.
In becoming better at fulfilling our narrower needs,
Say that you are in the search engine business. And say that you have built a model that estimates how relevant an ad is based on the ‘context’: search query, previous few queries, kind of device, location, and such. Now let’s assume that for context X, the rank-ordered list of ads based on expected profit is: product A, product B, and product C. Now say that you want to estimate how effective an ad for product A is in driving the sales of product A. One conventional way to estimate this is to randomly assign during serve time: for context X, serve half the people an ad for product A and serve half the people no ad. But if it is true (and you can verify this) that an ad for product B doesn’t cause people to buy product A, then you can switch the ‘no ad’ control where you are not making any money with an ad for product B. With this, you can estimate the effectiveness of ad for product A while sacrificing the least amount of revenue. Better yet, if it is true that ad for product A doesn’t cause people to buy product B, you can also at the same time get an estimate of the efficacy of ad for product B.