Superhuman: Can ML Beat Human-Level Performance in Supervised Models?

20 Dec

A supervised model cannot do better than its labels. (I revisit this point later.) So the trick is to make labels as good as you can. The errors in labels stem from three sources: 

  1. Lack of Effort: More effort people spend labeling something, presumably the more accurate it will be.
  2. Unclear Directions: Unclear directions can result from a. poorly written directions, b. conceptual issues, c. poor understanding. Let’s tackle conceptual issues first. Say you are labeling the topic of news articles. Say you come across an article about how Hillary Clinton’s hairstyle has evolved over the years. Should it be labeled as politics, or should it labeled as entertainment (or my preferred label: worthless)? It depends on taste and the use case. Whatever the decision, it needs to be codified (and clarified) in the directions given to labelers. Poor writing is generally a result of inadequate effort.  
  3. Hardness: Is that a 4 or a 7? We have all suffered at the hands of CAPTCHA to know that some tasks are harder than others.   

The fix for the first problem is obvious. To increase effort, incentivize. Incentivize by paying for correctness—measured over known-knowns—or by penalizing mistakes. And by providing feedback to people on the money they lost or how much more others with a better record made.

Solutions for unclear directions vary by the underlying problem. To address conceptual issues, incentivize people to flag (and comment on) cases where the directions are unclear and build a system to collect and review prediction errors. To figure out if the directions are unclear, quiz people on comprehension and archetypal cases. 

Can ML Performance Be Better Than Humans?

If humans label the dataset, can ML be better than humans? The first sentence of the article suggests not. Of course, we have yet to define what humans are doing. If the benchmark is labels provided by a poorly motivated and trained workforce and the model is trained on labels provided by motivated and trained people, ML can do better. The consensus label provided by a group of people will also generally be less noisy than one provided by a single person.    

Andrew Ng brings up another funny way ML can beat humans—by not learning from human labels very well. 

When training examples are labeled inconsistently, an A.I. that beats HLP on the test set might not actually perform better than humans in practice. Take speech recognition. If humans transcribing an audio clip were to label the same speech disfluency “um” (a U.S. version) 70 percent of the time and “erm” (a U.K. variation) 30 percent of the time, then HLP would be low. Two randomly chosen labelers would agree only 58 percent of the time (0.72 + 0.33). An A.I. model could gain a statistical advantage by picking “um” all of the time, which would be consistent with 70 percent of the time with the human-supplied label. Thus, the A.I. would beat HLP without being more accurate in a way that matters.

The scenario that Andrew draws out doesn’t seem very plausible. But the broader point about thinking hard about cases which humans are not able to label consistently is an important one and worth building systems around.

No Shit! Open Defecation in India

20 Dec

On Oct. 2nd, 2019, on Mahatma Gandhi’s 150th birthday, and just five years after the launch of the Swachh Bharat Campaign, Prime Minister Narendra Modi declared India ODF.

From https://sbm.gov.in/sbmdashboard/IHHL.aspx
Note the legend at the bottom. The same legend applies to the graphs in the gallery below.

The 2018-2019 Annual Sanitation Survey corroborates the progress:

From the 2018-19 National Annual Rural Sanitation Survey
From the 2018-19 National Annual Rural Sanitation Survey

Reducing open defecation matters because it can reduce child mortality and stunting. For instance, reducing open defecation to the levels among Muslims can increase the number of children surviving till the age of 5 by 1.7 percentage points. Coffey and Spears make the case that open defecation is the key reason why India is home to nearly a third of stunted children in the world. (See this paper as well.) (You can read my notes on Coffey and Spears’ book here. )

If the data are right, it is a commendable achievement, except that the data are not. As the National Statistical Office 2019 report, published just a month after the PM’s announcement, finds, only “71.3% of (rural) households [have] access to a toilet” (BBC). 

The situation in some states is considerably grimmer.

Like the infomercial where the deal only gets better, the news here only gets worse. For India to be ODF, people not only need to have access to the toilets but also need to use them. It is a key point that Coffey and Spears go to great lengths to explain. They report results from the SQUAT survey, which finds that of the households with latrines, 40% of the households have at least one person who defecates outside.

The government numbers stink. But don’t let the brazen number fudging take away from the actual accomplishment of building millions of toilets and a 20+ percentage point decline in open defecation in rural areas between 2009 and 2017 (based on WHO and Unicef data). (The WHO and Unicef data are corroborated by other sources including the 2018 r.i.c.e survey, which finds: “44% of rural people over two years old in rural Bihar, Madhya Pradesh, Rajasthan, and Uttar Pradesh defecate in the open. This is an improvement: 70% of rural people in the 2014 survey defecated in the open.”)

No Props for Prop 13

14 Dec

Proposition 13 enacted two key changes: 1. it limited property tax to 1% of the cash value, and 2. limited annual increase of assessed value to 2%. The only way the assessed value can change by more than 2% is if the property changes hands (a loophole allows you to change hands without officially changing hands). 

One impressive result of the tax is the inequality in taxes. Sample this neighborhood in San Mateo where taxes range from $67 to nearly $300k.

Take out the extremes, and the variation is still hefty. Property taxes of neighboring lots often vary by well over $20k. ) My back of the envelope estimate of standard deviation based on ten properties chosen at random is $23k.)

Sample another from Stanford where the range is from -$2k (not clear what causes negative taxes) to nearly $59k.

Prop. 13 has a variety of more material perverse consequences. Property taxes are one reason by people move from their suburban houses near the city to other more remote, cheaper places. But Prop. 13 reduces the need to move out. This likely increases property prices, which in turn likely lowers economic growth as employers choose other places. And as Chaste, a long-time contributor to the blog points out, it also means that the currently employed often have to commute longer distances, which harms the environment in addition to harming the families of those who commute.

p.s. Looking at the property tax data, you see some very small amounts. For instance, $19 property tax. When Chaste dug in, he found that the property was last sold in 1990 for $220K but was assessed at $0 in 2009 when it passed on to the government. The property tax on government-owned properties and affordable housing in California is zero. And Chaste draws out the implication: “poor cities like Richmond, which are packed with affordable housing, not only are disproportionately burdened because these populations require more services, they also receive 0 in property taxes from which to provide those services.”

p.p.s. My hunch is that a political campaign that uses property taxes in CA as a targeting variable will be very successful.

p.p.p.s. Chaste adds: “Prop 13 also applies to commercial properties. Thus, big corps also get their property tax increases capped at 2%. As a result, the sales are often structured in ways that nominally preserves existing ownership.

There was a ballot proposition on the November 2020 ballot, which would have removed Prop 13 protections for commercial properties worth more than $3M. Residential properties over $3M would continue to enjoy the protection. Even this prop failed 52%-48%. People were perhaps scared that this would be the first step in removing Prop 13 protections for their own homes.”

Sense and Selection

11 Dec

The following essay is by Chaste. The article was written in early 2018.

———

I will discuss the confounding selection strategies of England, India, and South Africa in the recently finished series. I won’t talk about minutiae like whether Vince’s technique is suited to Australian conditions or whether Rohit Sharma with his current form or Rahane with his overseas quality should have started the series. This is about basic common sense and basic cricketing sense, which a sharp 10-year-old has, and which the selectors appear to lack. Part 1 talks about England’s Ashes selection; Part 2 about India and South Africa selections in the recent Test series.

Part 1

In the recent Ashes, were it not for Cook’s 244 in Melbourne, England would have lived up to their billing as 5-nil candidates. The 5-nil billing was unusual since England was 3rd in the ICC rankings on 105, and Australia was 5th on 97. So how did we get to the expectation of a whitewash?

The English team selection appeared almost geared to maximize the chances of a whitewash. The basics of selection are to identify certain spots and to select enough good options for the uncertain spots. The certain spots were clear: 1 wicketkeeper in Bairstow, two batsmen in Root and Cook, and four bowler/all-rounders in Anderson, Broad, Stokes, and Ali. In addition, Stoneman and Woakes were half-certain spots—sure to play at least 2-3 matches.

The selectors’ job was clear: make enough good selections to address the remaining 2.5 batting spots and the 0.5 bowling spot. And what did they do? They selected three batsmen (Ballance, Vince, and Malan) for the 2.5 batting spots and three bowlers (Ball, Overton, and Crane) for the 0.5 bowling spots.

Brilliant! This left England’s batting no margin for error. There was no backup opener, in effect locking in Stoneman for all five matches. Vince had a county average last season of 33, not much higher than Kyle Abbott, a tail-ender and Vince’s mate at Hampshire, who averaged 30. Let us also not forget that England’s primary innovation in the last couple of years is to become a very attractive batting side that can’t play swing, spin, pace, or bounce. True, the fragility of the English batting is hardly the selectors’ fault. It’s due primarily to England’s ground rating system, where the groundsmen get perfect scores for preparing perfect roads. But it is still the selectors’ job to address this fragility in their selections. Given that Australian wickets don’t turn much and that the open positions were 2, 3, and 5, you would have expected England to take a couple of spare openers (Robson and Roy, for example) who could have batted in any of those positions. Instead, they took only Ballance.

And what were the bowling selections for which England’s batting options were sacrificed? Neither of the two pace backups provided any variety to the attack. There is simply nothing that Ball and Overton can do that is better or different than Woakes. Plunkett, suited to Australian conditions, was ignored. Wood was ignored for the bizarre reason that he might not last the entire series. But wait, there was no chance that Ball or Overton (let alone both) would have played all five matches. Crane was selected on the chance that he might play in one match. Besides, Wood would not have been a good replacement for Woakes in more than 2–3 matches, so demanding his fitness for all five matches was pointless. As if all this absurdity wasn’t enough, when Stokes was ruled out, they replaced a batting all-rounder with another quick bowler/drinks carrier (Finn).

And what were the bowling selections for which England’s batting options were sacrificed? Neither of the two pace backups provided any variety to the attack. There is simply nothing that Ball and Overton can do that is better or different than Woakes. Plunkett, suited to Australian conditions, was ignored. Wood was ignored for the bizarre reason that he might not last the entire series. But wait, there was no chance that Ball or Overton (let alone both) would have played all five matches. Crane was selected on the chance that he might play in one match. Besides, Wood would not have been a good replacement for Woakes in more than 2–3 matches, so demanding his fitness for all five matches was pointless. As if all this absurdity wasn’t enough, when Stokes was ruled out, they replaced a batting all-rounder with another quick bowler/drinks carrier (Finn).

So what made the English selectors adopt strategies that maximized the chances of a whitewash? In recent years, England have adopted a policy of giving every batsman at least a 5–7 test run before the drop: plenty of chances to shine/rope to hang yourself. While the policy makes sense for experienced players, its merits for new batsmen are dubious. I don’t know that an excruciatingly prolonged examination of Roy’s form or Keaton Jennings’ technique during last summer helped those players. To say nothing of burdening the rest of the team with passengers. It is the kind of policy that only world-beating sides can afford. But England stuck to it even though they were looking at a 5-nil drubbing. Since each batsman had at least five tests left in their allotted “chance to fail or shine quota,” England didn’t pick alternate batsmen.

Part 2

There is a basic difference between batsmen and bowlers. Batsmen must stop batting as soon as they get out. Hence, when you increase the number of batsmen in your side, you are likely to get a higher score. Bowlers, on the other hand, can bowl until they drop down dead. Thus, in theory, bowling only Marshall and Garner would help you bowl the opposition out most cheaply. You add bowlers (Holding and Croft, for example) only to provide:

  • Adequate rest so that all bowlers can function properly.
  • Necessary variations: types of pace, bound, swing, and spin, etc.

Thus, your best combination is always the minimum number of bowlers (4) and the maximum number of batsmen (6 + keeper). Even if your side is blessed with a great all-rounder like Imran Khan or Keith Miller, you still go with six specialist batsmen. If you are looking to your 5th bowler for wickets, you have selected your top 4 bowlers poorly. It’s very helpful to have a batting all-rounder who can bowl well enough to rest the four main bowlers without releasing pressure. A great example is Mitchell Marsh in the recent Ashes, even though he didn’t take a single wicket all series.

There are a few cases where a 5th bowler/bowling all-rounder can be useful:

  • There is simply no chance of your team losing on a wicket full of runs. The only possibilities are a draining draw or going for a win on the 5th day.
  • The specialist batsmen on your bench don’t bat any better than your all-rounders. Recent England sides are a good example.

Far from having one or both of the above, this series … 

  • Was the first in test history with three or more matches in which every match saw the fall of 40 wickets.
  • Saw an average innings total of 218: South Africa’s average was 230, India’s was 206.
  • saw fewer than 350 overs (less than four full days of play) in its longest match.

Predictably then, the 5th bowlers were largely a waste. Ashwin and Maharaj bowled 18.1 overs in match 1, Phehlukwayo and Pandya bowled 18 overs in match 3. That’s right: they averaged less than five overs per innings over these two matches: a few balls more than the T20 quota. And it is for this reason that India dropped Rahane / Rohit Sharma, and South Africa dropped Bavuma.

Of course, we know that the 5th bowler is meant to signal aggression, positive intent, and other such buzzwords. But to an intelligent opponent, it only signals that you are clueless about test cricket. It is akin to Kohli repeatedly getting out to a 6th stump line in England, which shows a lack of understanding of the basics of test cricket. It is understandable that with an unrelenting diet of different forms of cricket, young cricketers like Kohli may not understand the basics specific to each form. But we have a right to expect better from the selectors and coach.

About Chaste

Chaste is a consumer in the addiction economy. He spends half his time on Cricinfo, and the other half hating himself for spending half his time on Cricinfo.

Subscribing To Unpopular Opinion

11 Dec

How does the move from advertising-supported content to a subscription model, e.g., NY Times, Substack luminaries, etc., change the content being produced? Ben Thompson mulls over the question in a new column. One of the changes he foresees is that the content will be increasingly geared toward subscribers—elites who are generally interested in “unique and provocative” content. The focus on unique and provocative can be problematic in at least three ways: 

  1. “Unique and provocative” doesn’t mean correct. And since people often confuse original, counterintuitive points as deep, correct, and widely true insights about the world, it is worrying. The other danger is that journalism will devolve into English literature.
  2. As soon as you are in the idea generation business, you pay less attention to “obvious” things, which are generally the things that deserve our most careful attention.
  3. There is a greater danger of people falling into silos. Ben quotes Johan Peretti: “A subscription business model leads towards being a paper for a particular group and a particular audience and not for the broadest public.” Ben summarizes Peretti’s point as: “He’s alluding, in part, to the theory that the Times’s subscriber base wants to read a certain kind of news and opinion — middle/left of center, critical of Donald Trump, etc. — and that straying from that can cost it subscribers.”

There are other changes that a subscriber driven model will wreak. The production of news will favor the concerns of the elites even more. The demise of “newspaper of record” will mean that a common understanding of what is important and how we see things will continue to decline.

p.s. It is not lost on me that Ben’s newsletter is one such subscriber driven outlet.

Too Much Churn: Estimating Customer Churn

18 Nov

A new paper uses financial transaction data to estimate customer churn in consumer-facing companies. The paper defines churn as follows:

There are three concerns with the definition:

  1. The definition doesn’t make clear what is the normalizing constant for calculating the share. Given that the value “can vary between zero and one,” presumably the normalizing constant is either a) total revenue in the same year in which customer buys products, b) total revenue in the year in which the firm revenue was greater.
  2. If the denominator when calculating s_fit is the total revenue in the same year in which the customer buys products from the company, it can create a problem. Consider a case where there is a customer that spends $10 in both year t and year t-k. And assume that the firm’s revenue in the same years is $10 and $20 respectively. In this case, the customer hasn’t changed his/her behavior but their share has gone from 1 to .5.
  3. Beyond this, there is a semantic point. Churn is generally used to refer to attrition. In this case, it covers both customer acquisition and attrition. It also covers both a reduction and an increase in customer spending.

A Fun Aside

“Netflix similarly was not in one of our focused consumer-facing industries according to our SIC classification (it is found with two-digit SIC of 78, which mostly contains movie producers)” — this tracks with my judgment of Netflix.

94.5% Certain That Covid Vaccine Will Be Less Than 94.5% Effective

16 Nov

“On Sunday, an independent monitoring board broke the code to examine 95 infections that were recorded starting two weeks after volunteers’ second dose — and discovered all but five illnesses occurred in participants who got the placebo.”

Moderna Says Its COVID-19 Vaccine Is 94.5% Effective In Early Tests

The data = control group is 5 out of 15k and the treatment group is 90 out of 15k. The base rate (control group) is .6%. When the base rate is so low, it is generally hard to be confident about the ratio (1 – (5/95)). But noise is not the same as bias. One reason to think why 94.5% is an overestimate is simply because 94.5% is pretty close to the maximum point on the scale.

The other reason to worry about 94.5% is that the efficacy of a Flu vaccine is dramatically lower. (There is a difference in the time horizons over which effectiveness is measured for Flu for Covid, with Covid being much shorter, but useful to take that as a caveat when trying to project the effectiveness of Covid vaccine.)

Fat Or Not: Toward ‘Proper Training of DL Models’

16 Nov

A new paper introduces a DL model to enable ‘computer aided diagnosis of obesity.’ Some concerns:

  1. Better baselines: BMI is easy to calculate and it would be useful to compare the results to BMI.
  2. Incorrect statement: The authors write: “the data partition in all the image sets are balanced with 50 % normal classes and 50 % obese classes for proper training of the deep learning models.” (This ought not to affect the results reported in the paper.)
  3. Ignoring Within Person Correlation: The paper uses data from 100 people (50 fat, 50 healthy) and takes 647 images of them (310 obese). It then uses data augmentation to expand the dataset to 2.7k images. But in doing the train/test split, there is no mention of splitting by people, which is the right thing to do.

    Start with the fact that you won’t see the people in your training data again when you put the model in production. If you don’t split train/test by people, it means that the images of the people in the training set are also in the test set. This means that the test set accuracy is likely higher than if you would run it on a fresh sample.

Not So Robust: The Limitations of “Doubly Robust” ATE Estimators

16 Nov

Doubly Robust (DR) estimators of ATE are all the rage. One popular DR estimator is Robins’ Augmented IPW (AIPW). The reason why Robins’ AIPW estimator is called doubly robust is that if either your IPW model or your y ~ x model is correctly specified, you get ATE. Great!

Calling something “doubly robust” (DR) makes you think that the estimator is robust to (common) violations of commonly made assumptions. But DR replaces one strong assumption with one marginally less strong assumption. It is common to assume that IPW or Y ~ X are right. But DR replaces either of these with the OR clause. So how common is it to get either of the models right? Basically never. If neither model is right, you multiply the bias terms. And that ought to blow up the bias.

(There is one more reason to worry about the use of word ‘robust.’ In statistics, it is used to convey robustness of to violations of distributional assumptions.)

Given the small advance in assumptions, it turns out that the results aren’t better either (and can be substantially worse):

  1. “None of the DR methods we tried … improved upon the performance of simple regression-based prediction of the missing values. (see here.)
  2. “The methods with by far the worst performance with regard to RSMSE are the Doubly Robust (DR) approaches, whose RSMSE is two or three times as large as the RSMSE for the other estimators.” (see here and the relevant table is included below.)
From Kern et al. 2016

Some people prefer DR for efficiency. But the claim for efficiency is based on strong assumptions being met: “The local semiparametric efficiency property, which guarantees that the solution to (9) is the best estimator within its class, was derived under the assumption that both models are correct. This estimate is indeed highly efficient when the π-model is true and the y-model is highly predictive.”

p.s. When I went through some of the lecture notes posted online, I was surprised that the lecture notes explain DR as “if A or B hold, we get ATE” but do not discuss the modal case.

Instrumental Music: When It Rains, It Pours

23 Oct

In a new paper, Jon Mellon reviews 185 papers that use weather as an instrument and finds that researchers have linked 137 variables to weather. You can read it as each paper needing to contend with 136 violations of the exclusion restriction, but the situation is likely less dire. For one, weather as an instrument has many varietals. Some papers use local (both in time and space) fluctuations in the weather for identification. At the other end, some use long-range (both in time and space) variations in weather, e.g., those wrought upon by climate. And the variables affected by each are very different. For instance, we don’t expect long-term ‘dietary diversity’ to be affected by short-term fluctuations in the local weather. A lot of the other variables are like that. For two, the weather’s potential pathways to the dependent variable of interest are often limited. For instance, as Jon notes, it is hard to imagine how rain on election day would affect government spending any other way except its effect on the election outcome. 

There are, however, some potential general mechanisms through which exclusion restriction could be violated. The first that Jon identifies is also among the oldest conjecture in social science research—weather’s effect on mood. Except that studies that purport to show the effect of weather on mood are themselves subject to selective response, e.g., when the weather is bad, more people are likely to be home, etc. 

There are some other more fundamental concerns with using weather as an instrument. First, when there are no clear answers on how an instrument should be (ahem!) instrumented, the first stage of IV is ripe for specification search. In such cases, people probably pick up the formulation that gives the largest F-stat. Weather falls firmly in this camp. For instance, there is a measurement issue about how to measure rain. Should it be the amount of rain or the duration of rain, or something else? And then there is a crudeness issue of the instrument as ideally, we would like to measure rain over every small geographic unit (of time and space). To create a summary measure from crude observations, we often need to make judgments, and it is plausible that judgments that lead to a larger F-stat. are seen as ‘better.’

Second, for instruments that are correlated in time, we need to often make judgments to regress out longer-term correlations. For instance, as Jon points out, studies that estimate the effect of rain on voting on election day may control long-term weather but not ‘medium term.’ “However, even short-term studies will be vulnerable to other mechanisms acting at time periods not controlled for. For instance, many turnout IV studies control for the average weather on that day of the year over the previous decade. However, this does not account for the fact that the weather on election day will be correlated with the weather over the past week or month in that area. This means that medium-term weather effects will still potentially confound short-term studies.”

The concern is wider and includes some of the RD designs that measure the effect of ad exposure on voting, etc.

What’s the Next Best Thing to Learn?

10 Oct

With Gaurav Gandhi

Recommendation engines are everywhere. These systems recommend what shows to watch on Netflix and what products to buy on Amazon. Since at least the Netflix Prize, the conventional wisdom is that recommendation engines have become very good. Except that they are not. Some of the deficiencies are deliberate. Netflix has made a huge bet on its shows, and it makes every effort to highlight its Originals over everything else. Some other efficiencies are a result of a lack of content. The fact is easily proved. How often have you done a futile extended search for something “good” to watch?

Take the above two explanations out, and still, the quality of recommendations is poor. For instance, Youtube struggles to recommend high-quality, relevant videos on machine learning. It fails on relevance because it either recommends videos that are too difficult or too easy. And it fails on quality—the opaqueness of explanation makes most points hard to understand. When I look back, most of the high-quality content on machine learning that I have come across is a result of subscribing to the right channels—human curation. Take another painful aspect of most recommendations: a narrow understanding of our interests. You watch a few food travel shows, and YouTube will recommend twenty others.

Problems

What is the next best thing to learn? It is an important question to ask. To answer it, we need to know the objective function. But the objective function is too hard to formalize and yet harder to estimate. Is it money we want, or is it knowledge, or is it happiness? Say we decide its money. For most people, after accounting for risk, the answer may be: learn to program. But what would the equilibrium effects be if everyone did that? Not great. So we ask a simpler question: what is the next reasonable unit of information to learn?

Meno’s paradox states that we cannot be curious about something that we know. Pair it with a Rumsfeld-ism: we cannot be curious about things we don’t know we don’t know. The domain of things we can be curious about hence are things we know that we don’t know. For instance, I know that I don’t know enough about dark matter. But the complete set of things we can be curious about includes things we don’t know we don’t know.

The set of things that we don’t know is very large. But that is not the set of information we can know. The set of relevant information is described by the frontier of our knowledge. The unit of information we are ready to consume is not a random unit from the set of things we don’t know but the set of things we can learn given what we know.  As I note above, a bunch of ML lectures that YouTube recommends are beyond me. 

There is a further constraint on ‘relevance.’ Of the relevant set, we are only curious about things we are interested in. But it is an open question about how we entice people to learn about things that they will find interesting. It is the same challenge Netflix faces when trying to educate people about movies people haven’t heard or seen.

Conditional on knowing the next best substantive unit, we care about quality.  People want content that will help them learn what they are interested in most efficiently. So we need to solve for the best source to learn the information.

Solutions

Known-Known

For things we know, the big question is how do we optimally retain things we know. It may be through Flashcards or what have you.

Exploring the Unknown

  1. Learn What a Person Knows (Doesn’t Know): The first step is in learning the set of information that the person doesn’t know is to learn what a person knows. The best way to learn what a person knows is to build a system that monitors all the information we consume on the Internet. 
  1. Classify. Next, use ML to split the information into broad areas. 
  1. Estimate The Frontier of Knowledge. To establish the frontier of knowledge, we need to put what people know on a scale. We can derive that scale by exploiting syllabi and class structure (101, 102, etc.) and associated content and then scaling all the content (Youtube video, books, etc.) by estimating similarity with the relevant level of content. (We can plausibly also establish a scale by following the paths people follow–videos that they start but don’t finish are good indications of being too easy or too hard, for instance.)

    We can also use tools like quizzes to establish the frontier, but the quizzes will need to be built from a system that understands how information is stacked.
  1. Estimate Quality of Content. Rank content within each topic and each level by quality. Infer quality through both explicit and implicit measures. Use this to build the relevant set.
  1. Recommend from the relevant set. Recommend a wide variety of content from the relevant set.

Distance Function For Matched Randomization

6 Oct

What is the right distance function for creating greater balance before we randomize? One way to do it is not to think too much about the distance function at all. For instance, this paper takes all the variables, treats them as the same (you can do normalization if you want to) and you can use Mahalanobis distance, or what have you. There are two decisions here: about the subspace and about the weights.

Surprisingly, these ad hoc choices don’t have serious pitfalls except that the balance we get finally in the Y_0 (which is the quantity of interest) may not be great. There is also one case where the method will fail. The point is best illustrated with a contrived example. Imagine there is just one observed X and let’s say it is pure noise. If we were to match on noise and then randomize, it will be the case that it will increase the imbalance in the Y_0 half the time and decrease it another half the time. In all, the benefit of spending a lot of energy on improving balance in an ad hoc space, which may or may not help the true objective function, is likely overstated.

If we have a baseline survey and baseline Ys and we assume that Y_lagged predicts Y_0, then the optimal strategy would be to match on lagged Y. If we have multiple time periods for which we have surveys, we can build a supervised learning model to predict Y in the next time period and match on the Y_hat. The same logic applies when we don’t have lagged_Y for all the rows. We can impute them with supervised learning.

Unmatched: The Problem With Comparing Matching Methods

5 Oct

In many matching papers, the key claim proceeds as follows: our matching method is better than others because on this set of contrived data, treatment effect estimates are closest to those from the ‘gold standard’ (experimental evidence).

Let’s side-step concerns related to an important point: evidence that a method works better than other methods on some data is hard to interpret as we do not know if the fact generalizes. Ideally, we want to understand the circumstances in which the method works better than other methods. If the claim is that the method always works better, then prove it.

There is a more fundamental concern here. Matching changes the estimand by pruning some of the data as it takes out regions with low support. But the regions that are taken out vary by the matching method. So, technically the estimands that rely on different matching methods are different—treatment effect over different sets of rows. And if the estimate from method X comes closer to the gold standard than the estimate from method Y, it may be because the set of rows method X selects produce a treatment effect that is closer to the gold standard. It doesn’t however mean that method X’s inference on the set of rows it selects is the best. (And we do not know how the estimate technically relates to the ATE.)

Optimal Recruitment For Experiments: Using Pair-Wise Matching Distance to Guide Recruitment

4 Oct

Pairwise matching before randomization reduces s.e. (see here, for instance). Generally, the strategy is used to create balanced control and treatment groups from available observations. But we can use the insight for optimal sample recruitment especially in cases where we have a large panel of respondents with baseline data, like YouGov. The algorithm is similar to what YouGov already uses, except it is tailored to experiments:

  1. Start with a random sample.
  2. Come up with optimal pairs based on whatever criteria you have chosen.
  3. Reverse sort pairs by distance with the pairs with the largest distance at the top.
  4. Find the best match in the rest of the panel file for one of the randomly chosen points in the pair. (If you have multiple equivalent matches, pick one at random.)
  5. Proceed as far down the list as needed.

Technically, we can go from step 1 to step 4 if we choose a random sample that is half the size we want for the experiment. We just need to find the best matching pair for each respondent.

Rent-seeking: Why It Is Better to Rent than Buy Books

4 Oct

It has taken me a long time to realize that renting books is the way to go for most books. The frequency with which I go back to a book is so low that I don’t really see any returns on permanent possession that accrue from the ability to go back.

Renting also has the virtue of disciplining me: I rent when I am ready to read and it incentives me to finish the book (or graze and assess whether the book is worth finishing) before the rental period expires.

For e-books, my format of choice, buying a book is even less attractive. One reason why people buy a book is for the social returns from displaying the book on a bookshelf. E-books don’t provide that, though in time people may devise mechanisms to do just that. Another reason why people prefer buying books is that they want something ‘new.’ Once again, the concern doesn’t apply to e-books.

From a seller’s perspective, renting has the advantage of expanding the market. Sellers get money from people who would otherwise not buy the book. These people may, instead, substitute it by copying the book or borrowing it from a friend or a library or getting similar content elsewhere, e.g., Youtube or other (cheaper) books, or they may simply forgo reading the book.

STEMing the Rot: Does Relative Deprivation Explain Low STEM Graduation Rates at Top Schools?

26 Sep

The following few paragraphs are from Sociation Today:


Using the work of Elliot (et al. 1996), Gladwell compares the proportion of each class which gets a STEM degree compared to the math SAT at Hartwick College and Harvard University.  Here is what he presents for Hartwick:

Students at Hartwick College

STEM MajorsTop ThirdMiddle ThirdBottom Third
Math SAT569472407
STEM degrees55.0%27.1%17.8

So the top third of students with the Math SAT as the measure earn over half the science degrees. 

    What about Harvard?   It would be expected that Harvard students would have much higher Math SAT scores and thus the distribution would be quite different.  Here are the data for Harvard:

Students at Harvard University

STEM MajorsTop ThirdMiddle ThirdBottom Third
Math SAT753674581
STEM degrees53.4%31.2%15.4%

     Gladwell states the obvious, in italics, “Harvard has the same distribution of science degrees as Hartwick,” p. 83. 

    Using his reference theory of being a big fish in a small pond, Gladwell asked Ms. Sacks what would have happened if she had gone to the University of Maryland and not Brown. She replied, “I’d still be in science,” p. 94.


Gladwell focuses on the fact that the bottom-third at Harvard is the same as the top third at Hartwick. And points to the fact that they graduate at very different rates. It is a fine point. But there is more to the data. The top-third at Harvard have much higher SAT scores than the top-third at Hartwick. Why is it the case that they graduate with a STEM degree at the same rate as the top-third at Hartwick? One answer to that is that STEM degrees at Harvard are harder. So harder coursework at Harvard (vis-a-vis Hartwick) is another explanation for the pattern we see in the data and, in fact, fits the data better as it explains the performance of the top-third at Harvard.

Here’s another way to put the point: If preferences for graduating in STEM are solely and almost deterministically explained by Math SAT scores, like Gladwell implicitly assumes, and the major headwinds are because of relative standing, then we should see a much higher STEM graduation rate for the top-third at Harvard. We should ideally see an intercept shift across schools, which we don’t see, but a common differential between the top and the bottom third.

Campaigns, Construction, and Moviemaking

25 Sep

American presidential political campaigns, big construction projects, and big-budget moviemaking have a lot in common. They are all complex enterprises with lots of moving parts, they all bring together lots of people for a short period, and they all need people to hit the ground running and execute in lockstep to succeed. Success in these activities relies a lot on great software and the ability to hire competent people quickly. It remains an open opportunity to build great software for these industries, software that allows people to plan and execute together.

Dismissed Without Prejudice: Evaluating Prejudice Reduction Research

25 Sep

Prejudice is a blight on humanity. How to reduce prejudice, thus, is among the most important social scientific questions. In the latest assessment of research in the area, a follow-up to the 2009 Annual Review article, Betsy Paluck et al., however, paint a dim picture. In particular, they note three dismaying things:

Publication Bias

Table 1 (see below) makes for grim reading. While one could argue that the pattern is explained by the fact that lab research tends to have smaller samples and has especially powerful treatments, the numbers suggest—see the average s.e. of the first two rows (it may have been useful to produce a $sqrt(1/n)$ adjusted s.e.)—that publication bias very likely plays a large role. It is also shocking to know that just a fifth of the studies have treatment groups with 78 or more people.

Light Touch Interventions

The article is remarkably measured when talking about the rise of ‘light touch’ interventions—short exposure treatments. I would have described them as ‘magical thinking’ for they seem to be founded in the belief that we can make profound changes in people’s thinking on the cheap. This isn’t to say light-touch interventions can’t be worked into a regime that affects profound change—repeated light touches may work. However, as far as I could tell, no study tried multiple touches to see how the effect cumulates.

Near Contemporaneous Measurement of Dependent Variables

Very few papers judged the efficacy of the intervention a day or more after the intervention. Given the primary estimate of interest is longer-term effects, it is hard to judge the efficacy of the treatments in moving the needle on the actual quantity of interest.   

Beyond what the paper notes, here are a couple more things to consider:

  1. Perspective getting works better than perspective-taking. It would be good to explore this further in inter-group settings.
  2. One way to categorize ‘basic research interventions’ is by decomposing the treatment into its primary aspects and then slowly building back up bundles based on data:
    1. channel: f2f, audio (radio, etc.), visual (photos, etc.), audio-visual (tv, web, etc.), VR, etc.
    2. respondent action: talk, listen, see, imagine, reflect, play with a computer program, work together with someone, play together with someone, receive a public scolding, etc.
    3. source: peers, strangers, family, people who look like you, attractive people, researchers, authorities, etc.
    4. message type: parable, allegory, story, graph, table, drama, etc.
    5. message content: facts, personal stories, examples, Jonathan Haidt style studies that show some of the roots of our morality are based on poor logic, etc.

everywhere: meeting consumers where they are

1 Sep

Content delivery is not optimized for the technical stack used by an overwhelming majority of people. The technical stack of people who aren’t particularly tech-savvy, especially those who are old (over ~60 years), is often a messaging application like FB Messenger or WhatsApp. They currently do not have a way to ‘subscribe’ to Substack newsletters or podcasts or Youtube videos in the messaging application that they use (see below for an illustration of how this may look on the iPhone messaging app.) They miss content. And content producers have an audience hole.

Credit: Gaurav Gandhi

A lot of the content is distributed only via email or distributed within a specific application. There are good strategic reasons for that—you get to monitor consumption, recommend accordingly, control monetization, etc. But the reason why platforms like Substack, which enable independent content producers, limit distribution to email is not as immediately clear. It is unlikely a deliberate decision. It is likely a decision based on a lack of infrastructure that connects publishing to various messaging platforms. The future of messaging platforms is Slack—a platform that integrates as many applications as possible. As Whatsapp rolls out its business API, there is a potential to build an integration that allows producers to deliver premium content, leverage other kinds of monetization, like ads, and even build a recommendation stack. Eventually, it would be great to build that kind of integration for each of the messaging platforms, including iMessage, FB Messenger, etc.

Let me end by noting that there is something special about WhatsApp. No one has replicated the mobile phone-based messaging platform. And the idea of enabling a larger stack based on phone numbers remains unplumbed. Duo and FaceTime are great examples but there is potential for so much more. For instance, a calendar app. that runs on the mobile phone ID architecture.

The (Mis)Information Age: Provenance is Not Enough

31 Aug

The information age has bought both bounty and pestilence. Today, we are deluged with both correct and incorrect information. If we knew how to tell apart correct claims from incorrect, we would have inched that much closer to utopia. But the lack of nous in telling apart generally ‘obvious’ incorrect claims from correct claims has brought us close to the precipice of disarray. Thus, improving people’s ability to identify untrustworthy claims as such takes on urgency.

http://gojiberries.io/2020/08/31/the-misinformation-age-measuring-and-improving-digital-literacy/

Inferring the Quality of Evidence Behind the Claims: Fact Check and Beyond

One way around misinformation is to rely on an expert army that assesses the truth value of claims. However, assessing the truth value of a claim is hard. It needs expert knowledge and careful research. When validating, we have to identify with which parts are wrong, which parts are right but misleading, and which parts are debatable. All in all, it is a noisy and time-consuming process to vet a few claims. Fact check operations, hence, cull a small number of claims and try to validate those claims. As the rate of production of information increases, thwarting misinformation by checking all the claims seems implausibly expensive.

Rather than assess the claims directly, we can assess the process. Or, in particular, the residue of one part of the process for making the claim—sources. Except for claims based on private experience, e.g., religious experience, claims are based on sources. We can use the features of these sources to infer credibility. The first feature is the number of sources cited to make a claim. All else equal, the more number of sources saying the same thing, the greater the chances that the claim is true. None of this is to undercut a common observation: lots of people can be wrong about something. A harder test for veracity if a diverse set of people say the same thing. The third test is checking the credibility of the sources.

Relying on the residue is not a panacea. People can simply lie about the source. We want the source to verify what they have been quoted as saying. And in the era of cheap data, this can be easily enabled. Quotes can be linked to video interviews or automatic transcriptions electronically signed by the interviewee. The same system can be scaled to institutions. The downside is that the system may prove onerous. On the other hand, commonly, the same source is cited by many people so a public repository of verified claims and evidence can mitigate much of the burden.

But will this solve the problem? Likely not. For one, people can still commit sins of omission. For two, they can still draft things in misleading ways. For three, trust in sources may not be tied to correctness. All we have done is built a system for establishing provenance. And establishing the provenance is not enough. Instead, we need a system that incentivizes both correctness and presentation that makes correct interpretation highly likely. It is a high bar. But it is the bar—correct and liable to correctly interpreted.

To create incentives for publishing correct claims, we need to either 1. educate the population, which brings me to the previous post, or 2. find ways to build products and recommendations that incentivize correct claims. We likely need both.