ML (O)Ops: What Data To Collect? (part 3)

16 Jun

The first part of the series, “Improving and Deploying On-Device Models With Confidence,” is posted here. The second part, “Keeping Track of Changes,” is posted here.

With Atul Dhingra

For a broad class of machine learning problems, nitpicking over the neural net architecture is over (see, for instance, here). Instead, the focus has shifted to data. In the note below, we articulate some ways of thinking about what data to collect. In our discussion, we focus on supervised learning. 

The answer to “What data to collect?” varies by where you are in the product life cycle. If you are building a new ML product and the aim is to deploy something (basic) that delivers value and then iterate on it, one answer to the question is to label easy-to-predict cases—cases that allow you to build models where the precision is high but the recall is low. The bar is whether the model can do as well as business as usual for a small set of cases. The good thing is that you can hurdle that bar another way—by coding a random sample, building a model, and choosing a threshold where the precision is greater than business as usual (read more here). For producing POCs, models built on cheap data, e.g., open-source data, which plausibly do not produce value, can also “work” though they need to be managed against the threat of poor performance reducing faith in the system. 

The more conventional case is where you have a deployed model, and you want to improve its performance. There the answer to what data to collect is data that yields the highest ROI. (The answer to what data provides the highest ROI will vary over time, so we need a system that continuously answers it.) If we assume that the labeling costs for points are the same, the prioritization function reduces to ranking data by returns. To begin with, let’s assume that returns are measured by the function specified by the cost function. So, for instance, if we are looking for a model that lowers the RMSE, we would like to rank by how much reduction in RMSE we get from labeling an additional point. And naturally, we care about the test set RMSE. (You can generalize this intuition to any loss function.) So far, so good. The rub comes from the fact that there is no trivial answer to the problem. 

One way to answer the question is to run experiments, sampling across Xs, or plausibly use bandits and navigate the explore-exploit tradeoff smartly. Rather than do experiments, you can also exploit the data you have to figure out the kinds of points that make the most impact on RMSE. One way to get at that is using influence functions. There are, however, a couple of challenges in using these methods. The first is that the covariate space is large and the marginal impact is small, and that means inference is noisy. The second is a more general problem. Say you find that X_1, X_2, X_3, … are the points that lead to the largest reduction in RMSE. But how do you use that knowledge to convert it into a data collection problem? Is it that we should collect replicas of X_1? Probably not. We need to generalize from these examples and come up with a statement about the “type of data” that needs to be collected, e.g., more images where the traffic sign is covered by trees. To come up with the ‘type’, we need to specify what the example is not—how does it differ from the rest of the data we have? There are a couple of ways to answer the question. The first is to use clustering (using embeddings) and then assigning someone to label the clusters. Another is to use supervised learning to classify the X_1, X_2, X_3 from the rest of the data and figure out the “important predictors.” 

There are other answers to the question, “What data to collect?” For instance, we could look to label points where we are least certain or where we make the largest error. The intuition in the classification setting is that these points are closest to the hyperplane that separates the classes, and if you can learn to classify near the boundary, you are set. In using this method, you can also sometimes discover mislabeling. (The RMSE method we talk about above doesn’t interrogate the Y, taking the labels as given.) 

Another way to answer the question is to use model interpretation tools to figure out “why” the models are making errors. For instance, you could find that the reason why the model is making errors is because of confounding. Famously, for instance, a cat vs. dog classifier can merely be an outdoor vs. indoor classifier. And if we see the model using confounding features like the background in consideration, we could a) better label the data to segment out dogs and cats from the background, b) introduce paired examples such that the only thing different between any two images is strictly presence or absence of a dog/cat.

Partisan Morality

11 Jun

Sinn Féin and Fianna Fáil have said that activists posed as members of a polling company and went door-to-door to canvass the opinions of voters.

The rationale is simple. If you pose as an SF worker, you are likely to be met with shut doors or opinions in favor of SF got under slight duress. Is it a bridge too far or is it a harmless lie? More generally, do we use the same moral reasoning paradigm for violations by co-partisans and opposing partisans? My hunch is that for such kinds of violations we use a deontological framework for opposing partisans and a consequentialist one for co-partisans. The framework we use may switch depending on the circumstance. One way to test it would be to do a survey experiment with the above news article, switching parties. To get a better baseline, it may be useful to do three conditions: party_a, party_b, consumer_brand, e.g., Coke, etc.

Market Welfare: Why Are Covid-19 Vaccines Still Underfunded?

11 Jun

“To get roughly 70% of the planet’s population inoculated by April, the IMF calculates, would cost just $50bn. The cumulative economic benefit by 2025, in terms of increased global output, would be $9trn, to say nothing of the many lives that would be saved.”

The Economist frames this as an opportunity for G7. And it is. But it is also an opportunity for third-world countries, which plausibly can borrow $50bn given the return on investment. The fact that money hasn’t already been allocated poses a puzzle. Is it because governments think about borrowing decisions based on whether or not a policy is tax revenue positive (which a 180x return ought to be even with low tax collection and assessment rates)? Or is it because we don’t have a marketplace where we can transact on this information? If so, it seems like an important hole.

Here’s another way to look at this point. Among countries where the profits mostly go to a few, why do the people at the top not come to invest together so that they can harvest profits later? Brunei is probably an ok example.

The Story of Science: Storytelling Bias in Science

7 Jun

Often enough, scientists are left with the unenviable task of conducting an orchestra with out-of-tune instruments. They are charged with telling a coherent story about noisy results. Scientists defer to the demand partly because there is a widespread belief that a journal article is the appropriate grouping variable at which results should ‘make sense.’

To tell coherent stories with noisy data, scientists resort to a variety of underhanded methods. The first is simply squashing the inconvenient results—never reporting them or leaving them to the appendix or couching the results in the language of the trade, e.g., “the result is only marginally significant” or “the result is marginally significant” or “tight confidence bounds” (without ever talking about the expected effect size). Secondly, if good statistics show uncongenial results, drown the data in bad statistics, e.g., report the difference between a significant and an insignificant effect as significant. The third trick is overfitting. A sin in machine learning is a virtue in scientific storytelling. Come up with fanciful theories that could explain the result and make that the explanation. The fourth is to practice the “have your cake and eat it too” method of writing. Proclaim big results at the top and offer a thick word soup in the main text. The fifth is to practice abstinence—abstain from interpreting ‘inconsistent’ results as coming from a lack of power, bad theorizing, or heterogeneous effects.

The worst outcome of all of this malaise is that many (expectedly) become better at what they practice—bad science and nimble storytelling.

The Hateful ATE: The Effect of Affective Polarization

7 Jun

In a new paper, Broockman et al. use a clever manipulation to induce “three decades of change in affective polarization”:

In typical trust games, there are two players. Player 1 receives a cash allocation and is instructed to give “some, all, or none” of the money to Player 2. The player is also told that the researchers will triple any amount Player [1] gives to Player 2 and that Player 2 can return some, all, or none of the money back to Player 1. Therefore, the more Player 1 expects reciprocity from Player 2, the more money they should allocate to Player 2 in anticipation they will receive a larger sum in return, and the better off Player 2 will be. For example, if Player 1 gives all her money to Player 2, this sum would be tripled, and Player 2 could return half of the tripled amount to Player 1—leaving both players with 50% more than Player 1’s initial allocation. But if Player 1 gives no money to Player 2, Player 1 leaves with only her initial allocation and Player 2 leaves with nothing.

First, we always make participants take the role of Player 2. This means they always first observe an allocation another player makes to them. Second, across three consecutive rounds of game play, participants are told they are interacting with three other respondents of the opposite political party who have each been allocated $10. However, they are in fact are interacting with computerized opponents who offer allocations based on a pre-determined script. Participants randomized to the Positive Experience condition receive allocations from Player 1 of $8, $7 and $8 (tripled to $24, $21 and $24) respectively across the three rounds of the game. However, those in the Negative Experience condition receive $0 allocations in all three rounds.

Broockman et al. 2021

Next, comes the punchline. “Player 1’s reason for their allocation to you: your partisanship (all rounds), your income (Round 2)”. See Page 65.

Being told that a co- or opposing- partisan gave $0 versus being told that they gave $8, $7, and $8 because of your partisanship across three rounds has a dramatic effect on partisans’ feelings: partisans’ feelings toward opposing partisans become ‘cooler,’ it doesn’t affect their feelings towards co-partisans (impressive), and (strangely) polarizes their feelings toward elites (see the figure below).

Three comments are in order.

First, the manipulation is unrealistic given previous effect sizes (see here).“The average amount allocated to copartisans in the trust game was $4.58 (95% confidence interval [4.33, 4.83]), representing a “bonus” of some 10% over the average allocation of $4.17.”

Second, the manipulation principally ought to change perceptions of how trusting people are and not how trustworthy they are. We don’t manipulate how deceitful the other person is but how fearful they are of not having their actions reciprocated. Disliking less trusting people is slightly weird and plausibly points to how the underlying antipathy can be exacerbated by treatments that do not present a clear reason for judging another person more harshly. Or it could be that not being seen as being trustworthy and losing out on money as a result of it is insulting and aggravating.

Whatever the reason, generalizing from a bad personal interaction to all other members of a group is disturbing. (The fact that treatment cools people’s feelings toward opposing partisans suggests people expect better from them, which is interesting.) Ascribing feelings from a bad personal experience to elites seems odder (and more disturbing) still.

The absence of commensurate co- and opposing- partisan feeling panels for elites feels odd.

The paper finds that having a “bad” personal experience (vis-a-vis a better one) with an opposing partisan increases interpersonal animus (plus polarization of feelings toward partisan elites) but doesn’t cause partisans to like opposing partisan MCs less or co-partisan MCs more (though see above. Note that the pooled estimate for the opposing party is 1.5% or so—which is about what I would expect; it likely deserves another run at the bank). (I didn’t understand the change from co-partisan and opposing-partisan MCs to “own MCs” in the next analysis, so I am omitting that.) The paper discusses other DVs: 

  1. Interest in expressing party-consistent issue preferences (no effect)
  2. Support for bi-partisan legislation (~ more in favor)
  3. Opposition to democratic norms (pooled index seems to move by d = .09 and is nearly sig. at conventional levels). (I make a special reference to the index because presumably it has the least measurement error and is least likely to show an idiosyncratic pattern given sample size. There is also a small point about how multiple comparison adjustments are made—plausibly they should account for measurement error.)
  4.  Endorsement of partisan-congenial claims (Ds yes; Rs no)

The theorized path from bad personal experience with a co- (or opposing) partisan to opposition to democratic norms, etc., seems convoluted to me. So let’s unpack the theoretical underpinnings of the expectations. Interpersonal animus among partisans is an indicator of affective polarization. And the experiment successfully manipulates interpersonal animus. So what’s the issue? One escape hatch is that the concept is not uni-dimensional. Another is that any increase in interpersonal affect manifests in political consequences only over long periods as it causes people to watch different media, trust different things, etc.

The True Ones: Best Guess of True Proportion of 1s

30 May

ML models are generally used to make predictions about individual observations. Sometimes, however, the business decision is based on aggregate data. For example, say a company sells pants and wants to know how many will be returned over a certain period. Say the company has an ML model that predicts the chance a customer will return a pant. A natural thing to do would be to use the individual returns to get an expected return count.

One way to get an expected return count, if the model produces calibrated probabilities, is to simply take the mean. But say that you built an ML model to predict a dichotomous variable and you only have access to categorized outputs (1s and 0s). Say for model X, for cat == 1, the OOS recall is r and precision = p. Let’s say we use the model to predict labels for another dataset. Let’s say we observe 100 1s and 200 0s. What is the best estimate of the true proportion of 1s in the new dataset?

The quantity of interest = TP + FN

TP + FN = TP/r

TP = (TP + FP)*p

TP + FN = ((TP + FP)*p)/r = 100*p/r

(TP + FN)/n = 100p/300r = p/3r

ML (O)Ops! Keeping Track of Changes (Part 2)

22 Mar

The first part of the series, “Improving and Deploying On-Device Models With Confidence”, is posted here.

With Atul Dhingra

One way to automate classification is to compare new instances to a known list and plug in the majority class of the exact match. For such instance-based learning, you often don’t need to version data; you just need a hash table. When you are not relying on an exact match—most machine learning—you often need to version data to reproduce the behavior.

Reproducibility is the bedrock of mature software engineering. It is fundamental because it allows you to diagnose issues. You can reproduce the behavior of a ‘version.’ With that power, you can correlate changes in inputs with changes in outputs. Systems that enable reproducibility, like version control, have another vital purpose—reducing risk stemming from changes and allow regression testing in systems that depend on data, such as ML. They reduce it by allowing for changes to be rolled back. 

To reproduce outputs from machine learning models, we need to do more than store data. We also need to store hyper-parameters, details about the OS, programming language, and packages, among other things. But given the primary value of reproducibility is instrumental—diagnosis—we not just want the ability to reproduce but also the ability to understand changes and correlate them. Current solutions miss the mark.

Current Solutions and Problems

One way to version data is to treat it as a binary blob. Store all the data you learned a model on to a server and store a reference to the data in your repository. If the data changes, store the new version and create a new pointer. One downside of using a <code>git lfs</code> like mechanism is that your storage blows up. Another is that build times can be large if the local cache is small or more generally if access costs are large. Yet another problem is the lack of a neat interface that allows you to track more than source data. 

DVC purports to solve all three problems. It solves the first by providing a way to not treat the data as a blob. For instance, in a computer vision workflow, the source data is image files with some elementary tags—labels, assignments to train and test, etc. The differences between data versions are 1) changes in images (additions mostly) and 2) changes in mapping to labels and assignments. DVC allows you to store the differences in corpora of images as a list of additional hashes to files. DVC is silent on the second point—efficient storage of changes in mappings. We come to it later. DVC purports to solve the second problem by allowing you to save to local cloud storage. But it can still be time-consuming to download data from the cloud storage buckets. The reason is as follows. Each time you want to work on an experiment, you need to clone the entire cache to check out the appropriate files. And if not handled properly, the cloning time often significantly exceeds typical training times. Worse, it locks you into a cloud provider for any optimizations you may want to alleviate these time-bound cache downloads. DVC purports to solve the last problem by using yaml, tags, etc. But anarchy prevails. 

Future Solutions

Interpretable Changes

One of the big problems with data versioning is that the diffs are not human-readable, much less comprehensible. The diffs are usually very long, and the changes in the diff are hashes, which means that to review an MR/PR/Diff, the reviewer has to check out the change and pull the data with the updated hashes. The process can be easily improved by adding an extra layer that auto-summarizes the changes into a human-readable form. We can, of course, easily do more. We can provide ways to understand how changes to inputs correlate with changes in outputs.

Diff. Tables

The standard method of understanding data as a blob seems uniquely bad. For conventional rectangular databases, changes can be understood as changes in functional transformations of core data lake tables. For instance, say we store the label assignments of images in a table. And say we revise the labels of 100 images. (The core data lake tables are immutable, so the changes are executed in the downstream tables.) One conventional way of storing the changes is to use a separate table for recording changes. Another is to write an update statement that is run whenever “the v2” table is generated. This means the differences across data are now tied to a data transformation computation graph. When data transformation is inexpensive, we can delay running the transformations till the table is requested. In other cases, we can cache the tables.

ML (O)Ops! Improving and Deploying On-Device Models With Confidence (Part 1)

21 Feb

With Atul Dhingra.

Part 1 of a multi-part series.

It is well known that ML Engineers today spend most of their time doing things that do not have a lot to do with machine learning. They spend time working on technically unsophisticated but important things like deployment of models, keeping track of experiments, etc.—operations. Atul and I dive into the reasons behind the status quo and propose solutions, starting with issues to do with on-device deployments. 

Performance on Device

The deployment of on-device models is complicated by the fact that the infrastructure used for training is different from what is used for production. This leads to many tedious rollbacks. 

The underlying problem is missing data. We are missing data on the latency in prediction, which is a function of i/o latency and the time taken to compute. One way to impute the missing data is to build a model that predicts latency based on various features of the deployed model. Given many companies have gone through thousands of deployments and rollbacks, there is rich data to learn from. Another is to directly measure the time with ‘shadow deployments—performance on redundant chips colocated with the production chip and getting exactly the same data at about the same time (a small lag in passing on the data to the redundant chips is just fine as we can start the clock at a different time).  

Predicting latency given a model and deployment architecture solves the problem of deploying reliably. It doesn’t solve the problem of how to improve the performance of the system given a model. To improve the production performance of ML systems, companies need to analyze the data, e.g., compute the correlation between load on the edge server and latency, and generate additional data by experimenting with various easily modifiable parts of the system, e.g., increasing capacity of the edge server, etc. (If you are a cloud service provider like AWS, you can learn from all the combinations of infrastructure that exist to predict latency for various architectures given a model and propose solutions to the customer.)

There is plausibly also a need for a service that helps companies decide which chip is optimal for deployment. One solution to the problem is as a service— a service that provides data on the latency of a model on different chips. 

To the Better End: How the Middle Can Improve the End

18 Feb

Neil deGrasse Tyson: “…[generational spaceships produce] interesting ethical questions … to bring an entire generation of humans into the world whose only mission is to bring another generation into the world with a goal that they will never see.”

Chuck Nice: “In a way, Neil, that is [the] kind of the spaceship that we’re on right now.”

Neil: “So you’re saying we already have a generation that we birth … and we train them to try to figure stuff out, and then we die off, and we will never know where that ends.”

Chuck: “…Absolutely! And we are all just doing that on a giant rock that’s floating through space on a destination to who knows where.”

Neil: “Actually, it’s not even …. [it is] just going around…”

Chuck: “…just going around in circles. We are the NASCAR of space travel right now!”

From a StarTalk episode on generational spaceships

Chuck nails it. We are the middle generations on a “spaceship.” We likely won’t get to answer the deepest questions like how something came from nothing. Our value lies in how well we provide three things to the next generation. 1. Nurturing a deeper inclination and greater ability to explore the deepest questions. 2. Leaving the next generation with better tools and more time to explore. 3. Giving them better skills to improve the world on all those fronts for the generation that comes after them.  

Based on the criteria above, we haven’t made enough progress. We have given people leisure time but also addictions to fill their leisure and not enough tools to choose wisely. We have also probably failed to instill a greater appreciation of the pleasures of answering the deepest questions. And we continue to leave the next generation with the burden of solving complex problems like climate change. We must rectify these failures if our lives must matter, if we are to be more than the NASCAR going around and around the track. 

Build Software for the Lay User

14 Feb

Most word processing software helpfully point out grammatical errors and spelling mistakes. Some even autocorrect. And some, like Grammarly, even give style advice. 

Now consider software used for business statistics. Say you want to compute the correlation between two vectors: [100, 2000, 300, 400, 500, 600] and [1, 2, 3, 4, 5, 17000]. Most (all?) software will output .65. (The software assume you want Pearson’s correlation.) Experts know that the relatively large value in the second vector has a large influence on the correlation. For instance, switching it to -17000 will reverse the correlation coefficient to -.65. And if you remove the last observation, the correlation is 1. But a lay user would be none the wiser. Common software, e.g., Excel, R, Stata, Google Sheets, etc., do not warn the user about the outlier and its potential impact on the result. They should.

Take another example—the fickleness of the interpretation of AUC when you have binary predictors (see here) as much depends on how you treat ties. It is an obvious but subtle point. Commonly used statistical software, however, do not warn people about the issue.

Given the rate of increase in the production of knowledge, increasingly everyone is a lay user. For instance, in 2013, Lin showed that estimating ATE using OLS with a full set of interactions improves the precision of ATE. But such analyses are uncommon in economics papers. The analysis could be absent for a variety of reasons: 1. ignorance, 2. difficulty in estimating the model, 3. do not believe the result, etc. However, only ignorance stands the scrutiny. The model is easy to estimate, so the second explanation is unlikely to explain much. The last explanation also seems unlikely, given the result was published in a prominent statistical journal and experts use it.

If ignorance is the primary explanation, should the onus of being well informed about the latest useful discoveries in methods fall on researchers working in a substantive area? Plausibly. But that is clearly not working very well. One way to accelerate the dissemination of useful discoveries is via software, where you can provide such guidance as ‘warnings.’ 

The guidance can be put in manually. Or we can use machine learning, exploiting the strategy used by Grammarly, which uses expert editors to edit lay user sentences and uses that as training data.

We can improve science by building software that provides better guidance. The worst case for such software is probably business-as-usual, where some researchers get bad advice, and many get no advice.

This Time It’s Different: Polarization of the American Polity

10 Jan

In a new paper, Pierson and Shickler contend that this era of polarization is different. They fear that polarization this time will continue to intensify because the three “meso-institutions”—interest groups, state parties, and the media—that were the bulwark against polarization in earlier eras are themselves polarized or have changed in ways that they offer much less resistance:

  1. State Parties
    • State Parties Have Polarized “state party platforms are more similar across states and more distinctive across parties than in earlier eras (Paddock 2005, 2014; Hopkins & Schickler 2016).”
    • Federal Government is Much Bigger. This means state concerns matter less — which brought cross-cutting cleavages into play. “Although it has received less discussion in the analysis of polarization, a second development in the 1960s and early 1970s—what Skocpol (2003, p. 135) has termed the “long 1960s”—was also critical: a dramatic expansion and centralization of public policy (Melnick 1994, Pierson 2007, Jones et al. 2019). Civil rights legislation was only the entering wedge. During the long 1960s, liberal Congresses enacted, often on a bipartisan basis, major new domestic spending programs (especially Medicaid and Medicare, which now account for roughly a quarter of federal spending as well as, in the case of Medicaid, a big share of state spending). They greatly enlarged the regulatory state, creating powerful new federal agencies (such as the Environmental Protection Agency) and enacting extensive rules covering environmental and consumer protection as well as workplace safety.”
  2. Interest Groups Have Polarized
    • “The powerful US Chamber of Commerce provides a striking illustration of the broader trend. Traditionally conservative but studiously nonaligned, it now carefully coordinates its extensive electoral activities with the Republican Party, and its political director (a former GOP operative) can refer unselfconsciously to Republican Senate candidates as “our ticket” (Hacker & Pierson 2016).”
  3. Media —- the usual story

Why This Time is Different

  • “The Civil War era represents an obvious extreme point in the intensity of divisions, yet the period of partisan polarization was remarkably brief: The major American parties featured deep internal divisions on slavery up until the mid-to-late 1850s, and the new Republican majority became deeply divided over Reconstruction and key economic questions soon after the war ended.”

Questions and Notes

  • Why are business interest groups not more bipartisan? For instance, if the US Chambers of Commerce is going hard R, is it a sign that it represents businesses of a particular sector/region? Is the consolidation of the economy (GDP) in cities causing this? If so, then how does the oncoming WFH change affect these things?
  • Given wide swings in policy regimes are expensive for business—for one, they cannot plan, what are the kinds of plays eventually big businesses will come up with. In some ways, for instance, Twitter banning Trump is predictable. Businesses will opt for stability where they can.
  • The more frightening turn in American politics is toward populism and identity politics—so much for the end of politics.
  • The party coalitions keep evolving. For instance, in 2020, poor White people were firmly in the column of Republicans. While as late as 2004, as Bartels pointed out, they were not.

Liberalizing Daughters: Do Daughters Cause MCs to be Slightly More Liberal on Women’s Issues?

25 Dec

Two papers estimate the impact of having a daughter on Members of Congress’ (MC’s) position on women’s issues. Washington (2008) finds that each additional daughter (conditional on the number of children) causes about a 2 point increase in liberalism on women’s issues using data from the 105th to 108th Congress. Costa et. al 2019 use data from 110th to 114th Congress to find there is a noisily estimated small effect that cannot be distinguished from zero.

Same Number, Different Interpretation

Washington (2008) argues that a 2 point effect is substantive. But Costa et al. argue that a 2–3 point change is not substantively meaningful.

“In all five specifications, the score increases by about two points with each additional daughter parented. For all but the 106th Congress, the number of female children coefficient is significantly different from zero at conventional levels. While that two point increase may seem small relative to the standard deviations of these scores, note that the female legislators, on average, score a significant seven to ten points higher on these rating scores. In other words, an additional daughter has about 25% of the impact on women’s issues that one’s own gender has.”

From Washington 2008

“The lower bound of the confidence interval for the first coefficient in Model
1, the effect of having a daughter on AAUW rating, is −3.07 and the upper
bound is 2.01, meaning that the increase on the 100-point AAUW scale for
fathers of daughters could be as high as 2.01 at the 90% level, but that AAUW
score could also decrease by as much as 3.07 points for fathers of daughters,
which is in the opposite direction than previous literature and theory would
have us expect. In both directions, neither the increase nor the decrease is
substantively very meaningful.

From Costa et. al 2019

Different Numbers

The two papers—Washington’s and Costa et al.—come to different conclusions. But why? Besides different data, there are fair many other differences in modeling choices including (p.s. this is not a comprehensive list):

  1. How the number of children are controlled for. Washington uses fixed effects for the number of children. This makes sense if you conceive the number of daughters as a random variable within people with the same number of children. Another way to think of it is as a block randomized experiment. Costa et al. write, “Following Washington (2008), we also include a control variable for the total number of children a legislator has.” But control for it linearly.
  2. Dummy Vs. Number of Daughters. Costa et al. have a ‘has daughter’ dummy that codes as 1 any MC with 1 or more daughter while Washington uses the number of daughters as the ‘treatment’ variable.

Common Issues

The primary dependent variable is votes chosen by an interest group. Doing so causes multiple issues. The first is incommensurability across time. The chosen votes are different because not only is the selection process in choosing the votes is likely different but also the selection process that goes into what things come to vote. So it could be the case that the effect hasn’t changed but the measurement instrument has. The second issue is that interest groups are incredibly strategic in choosing the votes. And that means they choose votes that don’t always have a strong, direct, unique, and obvious relationship to women’s welfare. For instance, AAUW chose the vote to confirm Neil Gorsuch as one of the votes. There are likely numerous considerations that go into voting for Neil Gorsuch, including conflicting considerations about women’s welfare. For instance, a senator who supports the women’s right to choose may vote for Neil Gorsuch even if there is concern that the judge will vote against it because they may think Gorsuch would support liberalizing the economy further which will have a beneficial impact on women’s economic status, which the senator may view as more important. Third, the number of votes chosen is tiny. For the 115th Congress, for the Senate, there are only 7 votes and only 6 for the House of Representatives. Fourth, it seems the papers treat the House of Representatives and Senate interchangeably when the votes are different. Fifth, one of the issues with imputing ideology from congressional votes is that the issues over which people get to express preferences is limited. So the implied differences are generally smaller than actual ideological differences. The point affects how we interpret the results.

It Depends! Effect of Quotas On Women’s Representation

25 Dec

“[Q]uotas are often thought of as temporary measures, used to improve the lot of particular groups of people until they can take care of themselves.”

Bhavnani 2011

So how quickly can be withdraw the quota? The answer depends—plausibly on space, office, and time.

“In West Bengal …[i]n 1998, every third G[ram] P[anchayat] starting with number 1 on each list was reserved for a woman, and in 2003 every third GP starting with number 2 on each list was reserved” (Beaman et al. 2012). Beaman et al. exploit this random variation to estimate the effect of reservation in prior election cycles on women being elected in the subsequent elections. They find that 1. just 4.8% of the elected ward councillors in non-reserved wards, 2. this number doesn’t change if a GP has been reserved once before, and 3. shoots up to a still-low 10.1% if the GP has been reserved twice before (see the last column of Table 11 below).

From Beaman et al. 2012

In a 2009 article, Bhavnani, however, finds a much larger impact of reservation in Mumbai ward elections. He finds that a ward being reserved just once before causes a nearly 18 point jump (see the table below) starting from a lower base than above (3.7%).

From Bhavnani 2009

p.s. Despite the differences, Beaman et al. footnote Bhavnani’s findings as: “Bhavnani (2008) reports similar findings for urban wards of Mumbai, where previous reservation for women improved future representation of women on unreserved seats.”

Beaman et al. also find that reservations reduce men’s biases. However, a 2018 article by Amanda Clayton finds that this doesn’t hold true (though the CI are fairly wide) in Lesotho, Kenya.

From Clayton 2018

Political Macroeconomics

25 Dec

Look Ma, I Connected Some Dots!

In late 2019, in a lecture at the Watson Center at Brown University, Raghuram Rajan spoke about the challenges facing the Indian economy. While discussing the trends in growth in the Indian economy (I have linked to the relevant section in the video. see below for the relevant slide), Mr. Rajan notes:

“We were growing really fast before the great recession, and then 2009 was a year of very poor growth. We started climbing a little bit after it, but since then, since about 2012, we have had a steady upward movement in growth going back to the pre-2000, pre-financial crisis growth rates. And then since about mid-2016 (GS: a couple of years after Mr. Modi became the PM), we have seen a steady deceleration.”

Raghuram Rajan at the Watson Center at Brown in 2019 explaining the graph below

The statement is supported by the red lines that connect the deepest valleys with the highest peak, eagerly eliding over the enormous variation in between (see below).

See Something, Say Some Other Thing

Not to be left behind, Mr. Rajan’s interlocutor Mr. Subramanian shares the following slide about investment collapse. Note the title of the slide and then look at the actual slide. The title says that the investment (tallied by the black line) collapses in 2010 (before Mr. Modi became PM).


If you are looking to learn more about some of the common techniques people use to lie with charts, you can read How Charts Lie. (You can read my notes on the book here.)

Superhuman: Can ML Beat Human-Level Performance in Supervised Models?

20 Dec

A supervised model cannot do better than its labels. (I revisit this point later.) So the trick is to make labels as good as you can. The errors in labels stem from three sources: 

  1. Lack of Effort: More effort people spend labeling something, presumably the more accurate it will be.
  2. Unclear Directions: Unclear directions can result from a. poorly written directions, b. conceptual issues, c. poor understanding. Let’s tackle conceptual issues first. Say you are labeling the topic of news articles. Say you come across an article about how Hillary Clinton’s hairstyle has evolved over the years. Should it be labeled as politics, or should it labeled as entertainment (or my preferred label: worthless)? It depends on taste and the use case. Whatever the decision, it needs to be codified (and clarified) in the directions given to labelers. Poor writing is generally a result of inadequate effort.  
  3. Hardness: Is that a 4 or a 7? We have all suffered at the hands of CAPTCHA to know that some tasks are harder than others.   

The fix for the first problem is obvious. To increase effort, incentivize. Incentivize by paying for correctness—measured over known-knowns—or by penalizing mistakes. And by providing feedback to people on the money they lost or how much more others with a better record made.

Solutions for unclear directions vary by the underlying problem. To address conceptual issues, incentivize people to flag (and comment on) cases where the directions are unclear and build a system to collect and review prediction errors. To figure out if the directions are unclear, quiz people on comprehension and archetypal cases. 

Can ML Performance Be Better Than Humans?

If humans label the dataset, can ML be better than humans? The first sentence of the article suggests not. Of course, we have yet to define what humans are doing. If the benchmark is labels provided by a poorly motivated and trained workforce and the model is trained on labels provided by motivated and trained people, ML can do better. The consensus label provided by a group of people will also generally be less noisy than one provided by a single person.    

Andrew Ng brings up another funny way ML can beat humans—by not learning from human labels very well. 

When training examples are labeled inconsistently, an A.I. that beats HLP on the test set might not actually perform better than humans in practice. Take speech recognition. If humans transcribing an audio clip were to label the same speech disfluency “um” (a U.S. version) 70 percent of the time and “erm” (a U.K. variation) 30 percent of the time, then HLP would be low. Two randomly chosen labelers would agree only 58 percent of the time (0.72 + 0.33). An A.I. model could gain a statistical advantage by picking “um” all of the time, which would be consistent with 70 percent of the time with the human-supplied label. Thus, the A.I. would beat HLP without being more accurate in a way that matters.

The scenario that Andrew draws out doesn’t seem very plausible. But the broader point about thinking hard about cases which humans are not able to label consistently is an important one and worth building systems around.

No Shit! Open Defecation in India

20 Dec

On Oct. 2nd, 2019, on Mahatma Gandhi’s 150th birthday, and just five years after the launch of the Swachh Bharat Campaign, Prime Minister Narendra Modi declared India ODF.

Note the legend at the bottom. The same legend applies to the graphs in the gallery below.

The 2018-2019 Annual Sanitation Survey corroborates the progress:

From the 2018-19 National Annual Rural Sanitation Survey
From the 2018-19 National Annual Rural Sanitation Survey

Reducing open defecation matters because it can reduce child mortality and stunting. For instance, reducing open defecation to the levels among Muslims can increase the number of children surviving till the age of 5 by 1.7 percentage points. Coffey and Spears make the case that open defecation is the key reason why India is home to nearly a third of stunted children in the world. (See this paper as well.) (You can read my notes on Coffey and Spears’ book here. )

If the data are right, it is a commendable achievement, except that the data are not. As the National Statistical Office 2019 report, published just a month after the PM’s announcement, finds, only “71.3% of (rural) households [have] access to a toilet” (BBC). 

The situation in some states is considerably grimmer.

Like the infomercial where the deal only gets better, the news here only gets worse. For India to be ODF, people not only need to have access to the toilets but also need to use them. It is a key point that Coffey and Spears go to great lengths to explain. They report results from the SQUAT survey, which finds that of the households with latrines, 40% of the households have at least one person who defecates outside.

The government numbers stink. But don’t let the brazen number fudging take away from the actual accomplishment of building millions of toilets and a 20+ percentage point decline in open defecation in rural areas between 2009 and 2017 (based on WHO and Unicef data). (The WHO and Unicef data are corroborated by other sources including the 2018 r.i.c.e survey, which finds that “44% of rural people over two years old in rural Bihar, Madhya Pradesh, Rajasthan, and Uttar Pradesh defecate in the open. This is an improvement: 70% of rural people in the 2014 survey defecated in the open.”)

No Props for Prop 13

14 Dec

Proposition 13 enacted two key changes: 1. it limited property tax to 1% of the cash value, and 2. limited annual increase of assessed value to 2%. The only way the assessed value can change by more than 2% is if the property changes hands (a loophole allows you to change hands without officially changing hands). 

One impressive result of the tax is the inequality in taxes. Sample this neighborhood in San Mateo where taxes range from $67 to nearly $300k.

Take out the extremes, and the variation is still hefty. Property taxes of neighboring lots often vary by well over $20k. ) My back-of-the-envelope estimate of standard deviation based on ten properties chosen at random is $23k.)

Sample another from Stanford where the range is from ~$2k to nearly $59k.

Prop. 13 has a variety of more material perverse consequences. Property taxes are one reason by people move from their suburban houses near the city to other more remote, cheaper places. But Prop. 13 reduces the need to move out. This likely increases property prices, which in turn likely lowers economic growth as employers choose other places. And as Chaste, a long-time contributor to the blog points out, it also means that the currently employed often have to commute longer distances, which harms the environment in addition to harming the families of those who commute.

p.s. Looking at the property tax data, you see some very small amounts. For instance, $19 property tax. When Chaste dug in, he found that the property was last sold in 1990 for $220K but was assessed at $0 in 2009 when it passed on to the government. The property tax on government-owned properties and affordable housing in California is zero. And Chaste draws out the implication: “poor cities like Richmond, which are packed with affordable housing, not only are disproportionately burdened because these populations require more services, they also receive 0 in property taxes from which to provide those services.”

p.p.s. My hunch is that a political campaign that uses property taxes in CA as a targeting variable will be very successful.

p.p.p.s. Chaste adds: “Prop 13 also applies to commercial properties. Thus, big corps also get their property tax increases capped at 2%. As a result, the sales are often structured in ways that nominally preserve existing ownership.

There was a ballot proposition on the November 2020 ballot, which would have removed Prop 13 protections for commercial properties worth more than $3M. Residential properties over $3M would continue to enjoy the protection. Even this prop failed 52%-48%. People were perhaps scared that this would be the first step in removing Prop 13 protections for their own homes.”

Sense and Selection

11 Dec

The following essay is by Chaste. The article was written in early 2018.


I will discuss the confounding selection strategies of England, India, and South Africa in the recently finished series. I won’t talk about minutiae like whether Vince’s technique is suited to Australian conditions or whether Rohit Sharma with his current form or Rahane with his overseas quality should have started the series. This is about basic common sense and basic cricketing sense, which a sharp 10-year-old has, and which the selectors appear to lack. Part 1 talks about England’s Ashes selection; Part 2 is about India and South Africa’s selections in the recent Test series.

Part 1

In the recent Ashes, were it not for Cook’s 244 in Melbourne, England would have lived up to their billing as 5-nil candidates. The 5-nil billing was unusual since England was 3rd in the ICC rankings on 105, and Australia was 5th on 97. So how did we get to the expectation of a whitewash?

The English team selection appeared almost geared to maximize the chances of a whitewash. The basics of selection are to identify certain spots and to select enough good options for the uncertain spots. The certain spots were clear: 1 wicketkeeper in Bairstow, two batsmen in Root and Cook, and four bowler/all-rounders in Anderson, Broad, Stokes, and Ali. In addition, Stoneman and Woakes were half-certain spots—sure to play at least 2-3 matches.

The selectors’ job was clear: make enough good selections to address the remaining 2.5 batting spots and the 0.5 bowling spot. And what did they do? They selected three batsmen (Ballance, Vince, and Malan) for the 2.5 batting spots and three bowlers (Ball, Overton, and Crane) for the 0.5 bowling spots.

Brilliant! This left England’s batting no margin for error. There was no backup opener, in effect locking in Stoneman for all five matches. Vince had a county average last season of 33, not much higher than Kyle Abbott, a tail-ender and Vince’s mate at Hampshire, who averaged 30. Let us also not forget that England’s primary innovation in the last couple of years is to become a very attractive batting side that can’t play swing, spin, pace, or bounce. True, the fragility of the English batting is hardly the selectors’ fault. It’s due primarily to England’s ground rating system, where the groundsmen get perfect scores for preparing perfect roads. But it is still the selectors’ job to address this fragility in their selections. Given that Australian wickets don’t turn much and that the open positions were 2, 3, and 5, you would have expected England to take a couple of spare openers (Robson and Roy, for example) who could have batted in any of those positions. Instead, they took only Ballance.

And what were the bowling selections for which England’s batting options were sacrificed? Neither of the two pace backups provided any variety to the attack. There is simply nothing that Ball and Overton can do that is better or different than Woakes. Plunkett, suited to Australian conditions, was ignored. Wood was ignored for the bizarre reason that he might not last the entire series. But wait, there was no chance that Ball or Overton (let alone both) would have played all five matches. Crane was selected on the chance that he might play in one match. Besides, Wood would not have been a good replacement for Woakes in more than 2–3 matches, so demanding his fitness for all five matches was pointless. As if all this absurdity wasn’t enough, when Stokes was ruled out, they replaced a batting all-rounder with another quick bowler/drinks carrier (Finn).

And what were the bowling selections for which England’s batting options were sacrificed? Neither of the two pace backups provided any variety to the attack. There is simply nothing that Ball and Overton can do that is better or different than Woakes. Plunkett, suited to Australian conditions, was ignored. Wood was ignored for the bizarre reason that he might not last the entire series. But wait, there was no chance that Ball or Overton (let alone both) would have played all five matches. Crane was selected on the chance that he might play in one match. Besides, Wood would not have been a good replacement for Woakes in more than 2–3 matches, so demanding his fitness for all five matches was pointless. As if all this absurdity wasn’t enough, when Stokes was ruled out, they replaced a batting all-rounder with another quick bowler/drinks carrier (Finn).

So what made the English selectors adopt strategies that maximized the chances of a whitewash? In recent years, England has adopted a policy of giving every batsman at least a 5–7 test run before the drop: plenty of chances to shine/rope to hang yourself. While the policy makes sense for experienced players, its merits for new batsmen are dubious. I don’t know that an excruciatingly prolonged examination of Roy’s form or Keaton Jennings’ technique during last summer helped those players. To say nothing of burdening the rest of the team with passengers. It is the kind of policy that only world-beating sides can afford. But England stuck to it even though they were looking at a 5-nil drubbing. Since each batsman had at least five tests left in their allotted “chance to fail or shine quota,” England didn’t pick alternate batsmen.

Part 2

There is a basic difference between batsmen and bowlers. Batsmen must stop batting as soon as they get out. Hence, when you increase the number of batsmen in your side, you are likely to get a higher score. Bowlers, on the other hand, can bowl until they drop down dead. Thus, in theory, bowling only Marshall and Garner would help you bowl the opposition out most cheaply. You add bowlers (Holding and Croft, for example) only to provide:

  • Adequate rest so that all bowlers can function properly.
  • Necessary variations: types of pace, bound, swing, spin, etc.

Thus, your best combination is always the minimum number of bowlers (4) and the maximum number of batsmen (6 + keeper). Even if your side is blessed with a great all-rounder like Imran Khan or Keith Miller, you still go with six specialist batsmen. If you are looking to your 5th bowler for wickets, you have selected your top 4 bowlers poorly. It’s very helpful to have a batting all-rounder who can bowl well enough to rest the four main bowlers without releasing pressure. A great example is Mitchell Marsh in the recent Ashes, even though he didn’t take a single wicket all series.

There are a few cases where a 5th bowler/bowling all-rounder can be useful:

  • There is simply no chance of your team losing on a wicket full of runs. The only possibilities are a draining draw or going for a win on the 5th day.
  • The specialist batsmen on your bench don’t bat any better than your all-rounders. Recent England sides are a good example.

Far from having one or both of the above, this series … 

  • Was the first in test history with three or more matches in which every match saw the fall of 40 wickets.
  • Saw an average innings total of 218: South Africa’s average was 230, and India’s was 206.
  • saw fewer than 350 overs (less than four full days of play) in its longest match.

Predictably then, the 5th bowlers were largely a waste. Ashwin and Maharaj bowled 18.1 overs in match 1, and Phehlukwayo and Pandya bowled 18 overs in match 3. That’s right: they averaged less than five overs per innings over these two matches: a few balls more than the T20 quota. And it is for this reason that India dropped Rahane / Rohit Sharma, and South Africa dropped Bavuma.

Of course, we know that the 5th bowler is meant to signal aggression, positive intent, and other such buzzwords. But to an intelligent opponent, it only signals that you are clueless about test cricket. It is akin to Kohli repeatedly getting out to a 6th stump line in England, which shows a lack of understanding of the basics of test cricket. It is understandable that with an unrelenting diet of different forms of cricket, young cricketers like Kohli may not understand the basics specific to each form. But we have a right to expect better from the selectors and coaches.

About Chaste

Chaste is a consumer in the addiction economy. He spends half his time on Cricinfo and the other half hating himself for spending half his time on Cricinfo.

Subscribing To Unpopular Opinion

11 Dec

How does the move from advertising-supported content to a subscription model, e.g., NY Times, Substack luminaries, etc., change the content being produced? Ben Thompson mulls over the question in a new column. One of the changes he foresees is that the content will be increasingly geared toward subscribers—elites who are generally interested in “unique and provocative” content. The focus on unique and provocative can be problematic in at least three ways: 

  1. “Unique and provocative” doesn’t mean correct. And since people often confuse original, counterintuitive points as deep, correct, and widely true insights about the world, it is worrying. The other danger is that journalism will devolve into English literature.
  2. As soon as you are in the idea generation business, you pay less attention to “obvious” things, which are generally the things that deserve our most careful attention.
  3. There is a greater danger of people falling into silos. Ben quotes Johan Peretti: “A subscription business model leads towards being a paper for a particular group and a particular audience and not for the broadest public.” Ben summarizes Peretti’s point as: “He’s alluding, in part, to the theory that the Times’s subscriber base wants to read a certain kind of news and opinion — middle/left of center, critical of Donald Trump, etc. — and that straying from that can cost it subscribers.”

There are other changes that a subscriber driven model will wreak. The production of news will favor the concerns of the elites even more. The demise of “newspaper of record” will mean that a common understanding of what is important and how we see things will continue to decline.

p.s. It is not lost on me that Ben’s newsletter is one such subscriber driven outlet.

Too Much Churn: Estimating Customer Churn

18 Nov

A new paper uses financial transaction data to estimate customer churn in consumer-facing companies. The paper defines churn as follows:

There are three concerns with the definition:

  1. The definition doesn’t make clear what is the normalizing constant for calculating the share. Given that the value “can vary between zero and one,” presumably the normalizing constant is either a) total revenue in the same year in which customer buys products, b) total revenue in the year in which the firm revenue was greater.
  2. If the denominator when calculating s_fit is the total revenue in the same year in which the customer buys products from the company, it can create a problem. Consider a case where there is a customer that spends $10 in both year t and year t-k. And assume that the firm’s revenue in the same years is $10 and $20 respectively. In this case, the customer hasn’t changed his/her behavior but their share has gone from 1 to .5.
  3. Beyond this, there is a semantic point. Churn is generally used to refer to attrition. In this case, it covers both customer acquisition and attrition. It also covers both a reduction and an increase in customer spending.

A Fun Aside

“Netflix similarly was not in one of our focused consumer-facing industries according to our SIC classification (it is found with two-digit SIC of 78, which mostly contains movie producers)” — this tracks with my judgment of Netflix.