One common variety of empirical papers in finance uses staggered adoption of a policy to identify the impact of the policy on firms registered in the state, e.g., profits, etc. There are a host of problems with such empirical strategies, including, prominently, 1. policy adoption and timing are generally endogenous to firm activity within and across states and 2. it is hard to account for the impact of other policies that may have been adopted after a policy and may explain the outcomes. Recent econometric literature adds to the list of issues by pointing out a serious problem with using two-way fixed effects to analyze such data. I add to the pile. After the first state has adopted a policy, any firm registrations are endogenous to the policy environment—firms choose to locate (relocate) based on state policies. So when analyzing data from such a research design, we may want to fix the cohort of firms to firm-registration status to before the time the first state changed its policy and treat registration changes, etc., as non-compliance.
Say that people can be easily identified by characteristic C. Say that the average tip left by people of group C_A is smaller than !C_A with a wide variance in tipped amounts within each group. Let’s assume that the quality of service (two levels: high or low) is pro-rated by the expected tip amount. Let’s assume that the tip left by a customer is explained by the quality of service. And let’s also assume that the expected tip amount from C_A is low enough to motivate low-quality service. The tip is provided after the service. Assume no-repeat visitation. The optimal strategy for the customer is to not tip. But the service provider notices the departure from rationality from customers and serves accordingly. If the server had complete information about what each person would tip, then the service would be perfectly calibrated by the tipped amount. However, the server can only rely on crude surface cues, like C, and estimate the expected value of the tip. Given that, the optimal strategy for the server is to provide low-quality service to C_A, which would lead to a negative spiral.
A host of cloud services are missing the core functionality needed to build businesses on top of the services. Powerful services on mature platforms like Google Vision, etc., have a common set of deficiencies—they do not allow clients to send information about preferred latency and throughput (for a price), and they do not allow clients to programmatically define SLAs (again, for a price). (If you read the documents of Google Vision, there is no mention of how quickly the API will return the answer for a document of a particular size.) One price for ~all requesters is the norm. Not only that, in the era of ‘endless scalable compute,’ throttling is ubiquitous.
There are two separate ideas here. The first is about how to solve one-off needs around throughput and latency. For a range of services, we can easily provide a few options that price in bandwidth and server costs. For a certain volume of requests, the services may require that the customer send a request outlining the need with enough lead time to boot new servers. The second idea is about programmatically signing SLAs. Rather than asking customers to go back and forth with Sales around custom pricing for custom needs, providing a few options for a set of standard use cases may be more expedient.
Some low-level services like s3 work almost like that today. But the move to abstracting out this paradigm to higher-level services has largely not begun. I believe it is time.
Traditionally software has been distributed as a binary. The customer “grants” the binary a broad set of rights on the machine and expects the application to behave, e.g., not snoop on personal data, not add the computer to a botnet, etc. Most SaaS can be delivered with minor alterations to the above—finer access control and usage logging. Such systems work on trust—the customer trusts that the vendor will do the right thing. It is a fine model but does not work for the long tail. For the long tail, you need a system that grants limited rights to the application and restricts what data can be sent back. This kind of model is increasingly common on mobile OS but absent on many other “platforms.”
The other big change over time in software has been how much data is sent back to the application maker. In a typical case, the SaaS application is delivered via a REST API, and nearly all the data is posted to the application’s servers. This brings up issues about privacy and security, especially for businesses. Let me give an example. Say there is an app that can summarize documents. And say that a business has a few million documents in a Dropbox folder on which it would like to run this application. Let’s assume that the app is delivered via a REST API, as many SaaS apps are. And let’s assume that the business doesn’t want the application maker to ‘keep’ the data. What’s the recourse? Here are a few options:
- Trust me. Large vendors like Google can credibly commit to models where they don’t store customer data. To the extent that storing customer data is valuable to the application developer, the application developer can also use price discrimination, providing separate pricing tiers for cases where the data is logged and where it isn’t. For example, see the Google speech-to-text API.
- Trust but verify. The application developer claims to follow certain policies, but the customer is able to verify, for e.g., audit access policies and logs. (A weaker version of this model is relying on industry associations that ‘certify’ certain data handling standards, e.g., SOC2.)
- Trusted third-party. The customer and application developer give some rights to a third party that implements a solution that ensures privacy and protects the application developer’s IP. For instance, AWS provides a model where the customer data and algorithm are copied over to an air-gapped server and the outputs written back to the customer’s disk.
Of the three options, the last option likely reduces friction the most for long tail applications. But there are two issues. First, such models are unavailable on a wide variety of “platforms,” e.g., Dropbox, etc. (or easy integrations with the AWS offering are uncommon). The second is that air-gapped copying is but one model. A neutral third party can provide interesting architectures, including strong port observability and customer-in-the-loop “data emission” auditing, etc.
It used to be that retail prices of generic products like coffee mugs, soap, etc., moved slowly. Not anymore. On major web retailers like Amazon, for a range of generic household products, the variation in prices over short periods of time is immense. For instance, on 12-Piece Porcelain, 12 Oz. Coffee Mug Set, the price ranged between $20.50 and $35.71 over the last year or so, with a hefty day-to-day variation.
On PCPartPicker, the variation in prices for Samsung SSD is equally impressive. Prices zig-zag on multiple sites (e.g., Dell, Adorama) by $100 over a matter of days multiple times over the last six months. (The cross-site variation—price dispersion—at a particular point in time is also impressive.)
Take another example. Softsoap Liquid Hand Soap, Fresh Breeze – 7.5 Fl Oz (Pack of 6) shows a very high-frequency change between $7.44 and $11. (See also Irish Spring Men’s Deodorant Bar Soap, Original Scent – 3.7 Ounce.)
What explains the within-site over-time variation? One reason could be supply and demand. There are three reasons I am skeptical of the explanation. First, on Amazon, the third-party new item price time series and Amazon price time series do not appear to be correlated (statistics by informal inspection or as one of my statistics professors used to call it—the ocular distortion test—so caveat emptor). On PCPartPicker, you see much the same thing: the cross-retailer price time series frequently crossover. Second, related to the first point, we should see a strong correlation in overtime price curves across substitutes. We do not. Third, the demand for generic household products should be readily forecastable, and the optimal dry good storage strategy is likely not storing just enough. Further, I am skeptical of strong non-linearities in the marginal cost of furnishing an item that is not in the inventory—much of it should be easily replenishable.
The other explanation is price exploration, with Amazon continuously exploring the profit-maximizing price. But this is also unpersuasive. The range over which the prices vary over short periods of time is too large, especially given substitutes and absent collusion. Presumably, companies have thought about the negative consequences of such wide price exploration bands. For instance, you cannot build a reputation as the ‘cheapest’ (unless there is coordination or structural reason for prices to move together.)
So I come empty when it comes to explanations. There is the crazy algorithm theory—as inventory dwindles, Amazon really hikes the price, and when it sees no sales, it brings the price right back down. It may explain the frequent sharp movements over a fixed band that you see in some places but plausibly doesn’t explain a lot of the other patterns we see.
Forget the explanations and let’s engage with the empirical fact. My hunch is that customers are unaware of the striking variation in the prices of many goods. Second, if customers become aware of this, their optimal strategy would be to use sites like CamelCamelCamel or PCPartPicker to pick the optimal time for purchasing a good. If retailers are somehow varying prices to explore profit-maximizing pricing (minus price discrimination based on location, etc.), and if all customers adopt the strategy of timing the purchase, then, in equilibrium, the retailer strategy would reduce to constant pricing.
p.s. I found it funny that there are ‘used product’ listings for soap.
p.p.s. I wrote about the puzzle of price dispersion on Amazon here.
In particular, in 521 villages in Haryana, we provided information on monthly immunization camps to either randomly selected individuals (in some villages) or to individuals nominated by villagers as people who would be good at transmitting information (in other villages). We find that the number of children vaccinated every month is 22% higher in villages in which nominees received the information.From Banerjee et al. 2019
The buildings, which are social units, were randomized to (1) targeting 20% of the women at random, (2) targeting friends of such randomly chosen women, (3) targeting pairs of people composed of randomly chosen women and a friend, or (4) no targeting. Both targeting algorithms, friendship nomination and pair targeting, enhanced adoption of a public health intervention related to the use of iron-fortified salt for anemia.
Coupon redemption reports showed that unadjusted adoption rates were 13.6% (SE = 1.5%) in the friend-targeted clusters, 11.2% (SE = 1.4%) in pair-targeted clusters, 9.1% (SE = 1.3%) in the randomly targeted clusters, and 0% in the control clusters receiving no intervention.From Alexander et al. 2022
Here’s a Twitter thread on the topic by Nicholas Christakis.
Targeting “structurally influential individuals,” e.g., people with lots of friends, people who are well regarded, etc., can lead to larger returns per ‘contact.’ This can be a useful thing. And as the studies demonstrate, finding these influential people is not hard—just ask a few people. There are, however, a few concerns:
- One of the concerns with any targeting strategy is that it can change who is treated. When you use network-based targeting, it biases the treated sample toward those who are more connected. That could be a good thing, especially if returns are the highest on those with the most friends, like in the case of curbing contagious diseases, or it could be a bad thing if the returns are the greatest on the least connected people. The more general point here is that most ROI calculations for network targeting have only accounted for costs of contact and assumed the benefits to be either constant or increasing in network size. One can easily rectify this by specifying the ROI function more fully or adding “fairness” or some kind of balance as a constraint.
- There is some stochasticity that stems from which person is targeted, and their idiosyncratic impact needs to be baked into standard error calculations for the ‘treatment,’ which is the joint of whatever the experimenters are doing and what the individual chooses to do with the experimenter’s directions (compliance needs a more careful definition). Interventions with targeting are liable to have thus more variable effects than without targeting and plausibly need to be reproduced more often before they used as policy.
Software engineering has changed dramatically in the last few decades. The rise of AWS, high-level languages, powerful libraries, and frameworks increasingly allow engineers to focus on business logic. Today, software engineers spend much of their time writing code that reasons over data to show something or do something. But how engineering is done has not caught up in some crucial ways:
- Software Development Tools. Most data scientists today work in a notebook on a server where they heavily interact with the data as they refine the code (algorithm). Most engineers still work locally without access to production data. Part of the reason engineers don’t have access to the data is because they work locally—for security and compliance reasons, access to production data from the local machine is banned in most places. Plausibly, a bigger reason is that engineers are stuck in a paradigm where they don’t think access to production data is foundational to faster, higher-quality software development. This belief is reflected in the ad-hoc solutions to the problem that are being tried across the industry, e.g., synthetic data (which is hard to create, maintain, and scale).
- Data Modeling. The focus on data modeling has sharply decreased over time in many companies. There are at least four underlying forces behind this trend. First, the combination of the volume of the data being generated and the rise of cheap blob storage (combined with the fact that computing power is comparatively vastly more expensive today) incentivizes the storage of unstructured data. Second, agile development, which prioritizes customer-facing progress over short time units, may cause underinvestment in costly, foundational work (see here). Third, the engineering organizations are changing in that the producers of the data are no longer seen as owners of the data. The fourth and last point is perhaps the most crucial—the surfeit of data has led to some magical thinking about the ease with which data can be used to power insights. Our ability to derive business insights from unstructured and dirty data, except for a small minority of cases, e.g., search, doesn’t exist. The only thing the surfeit of data has done is that it has widened and deepened the pool of insights that can be delivered. It hasn’t made it any easier to derive those insights, which continue to rely on good old-fashioned manual work to understand the use case and curate and structure the data appropriately. (It also then becomes an opportunity for building software.)
Engineers pay the price of not investing in data modeling by making the code more complex (and hence, more unmaintainable) and by allocating time to fix “bugs.” (The reason I put the word bugs in air quotes is because obvious consequences of a bad system should not be called bugs.)
- Data Drift. Machine Learning Engineers (MLEs) obsess about it. Most other engineers haven’t ever heard of the term. Everyone should worry. Technically, the only difference between using ML and engineering for rule creation is that ML auto-creates rules while conventional engineering relies on handcrafting the rules. Both systems test the efficacy of their rules on the current data. Both systems assume that the data will not drift. Only MLEs monitor the data, thinking hard about what data the rules work for and how to monitor data drift. Other engineers need to sign up.
The solutions are as simple as the problems are immense: invest in data quality, data monitoring, and data models. To achieve that, we need to change how organizations are structured, how they are run, and what engineers think the hard problems are.
This is a review of Noise, A Flaw in Human Judgment by Kahneman, Sibony, and Sunstein.
The phrase “noise in decision making” brings to mind “random” error. Scientists, however, shy away from random error. Science is mostly about systematic error, except, perhaps, quantum physics. So Kahneman et al. conceive of noise as seemingly random error that is a result of unmeasured biases. For instance, research suggests that heat causes bad mood. And bad mood may, in turn, cause people to judge more harshly. If this were to hold, the variability in judging stemming from the weather can end up being interpreted as noise. But, as is clear, there is no “random” error, merely bias. Kahneman et al. make a hash of this point. Early on, they give the conventional formula of total expected error as the sum of bias and variance (they don’t further decompose variance into irreducible error and ‘random’ error) with the aim of talking about the two separately, and naturally, never succeed in doing that.
The conceptual issues ought not detract us from the important point of the book. It is useful to think about human judgment systems as mathematical functions. We should expect the same inputs to map to the same output. It turns out that it isn’t even remotely true in most human decision-making systems. Take insurance underwriting, for instance. Given the same data (realistic but made-up information about cases), the median percentage difference between quotes between any pair of underwriters is an eye-watering 55% (which means that for half of the cases, it is worse than 55%), about five times as large as expected by the executives. There are a few interesting points that flow from this data. First, if you are a customer, your optimal strategy is to get multiple quotes. Second, what explains ignorance about the disagreement? There could be a few reasons. First, when people come across a quote from another underwriter, they may ‘anchor’ their estimate on the number they see, reducing the gap between the number and the counterfactual. Second, colleagues plausibly read to agree—less effort and optimizing for collegiality, asking, “Could this make sense?”, than read to evaluate, “Does this make sense?” (see my notes for a fuller set of potential explanations.)
Data from asylum reviews is yet starker. “A study of cases that were randomly allotted to different judges found that one judge admitted 5% of applicants, while another admitted 88%.” (Paper.)
Variability can stem from only two things. It could be that the data doesn’t allow for a unique judgment (irreducible error). (But even here, the final judgment should reflect the uncertainty in the data.) Or that at least one person is ‘wrong’ (has a different answer than others). Among other things, this can stem from:
- variation in skill, e.g., how to assess patent applications
- variation in effort, e.g., some people put more effort than others
- agency and preferences, e.g., I am a conservative judge, and I can deny an asylum application because I have the power to do so
- biases like using irrelevant information, e.g., weather, hypoglycemia, etc.
(Note: a lack of variability doesn’t mean we are on to the right answer.)
The list of proposed solutions is extensive—from selecting better judges to the wisdom of the crowds to using models to training people better to more elaborate schemes like dividing the decision task and asking people to make relative than absolute judgments. The evidence backing the solutions is not always hefty, which meshes with the ideolog-like approach to evidence present everywhere in the book. When I did a small audit of the citations, three things stood out (the overarching theme is adherence to the “No Congenial Result Scrutinized or Left Uncited Act”):
- Extremely small n studies cited without qualification. Software engineers.
Quote from the book: “when the same software developers were asked on two separate days to estimate the completion time for the same task, the hours they projected differed by 71%, on average.”
The underlying paper: “In this paper, we report from an experiment where seven experienced software professionals estimated the same sixty software development tasks over a period of three months. Six of the sixty tasks were estimated twice.
- Extremely small n studies cited without qualification. Israeli Judges.
Hypoglycemia and judgment: “Our data consist of 1,112 judicial rulings, collected over 50 d in a 10-mo period, by eight Jewish-Israeli judges (two females) who preside over two different parole boards that serve four major prisons in Israel.”
- Surprising but likely unreplicable results. “When calories are on the left, consumers receive that information first and evidently think “a lot of calories!” or “not so many calories!” before they see the item. Their initial positive or negative reaction greatly affects their choices. By contrast, when people see the food item first, they apparently think “delicious!” or “not so great!” before they see the calorie label. Here again, their initial reaction greatly affects their choices. This hypothesis is supported by the authors’ finding that for Hebrew speakers, who read right to left, the calorie label has a significantly larger impact..” (Paper.)
“We show that if the effect sizes in Dallas et al. (2019) are representative of the populations, a replication of the six studies (with the same sample sizes) has a probability of only 0.014 of producing uniformly significant outcomes.” (Paper.)
- Citations to HBR. Citations to think pieces in Harvard Business Review (10 citations in total based on a keyword search) and books like ‘Work Rules!’ for a fair many claims.
Here are my notes for the book.
Very little of the code that the government pays for is open-sourced. One of the reasons is that private companies would rather the code remain under wraps so that the errors never come to light, the price for producing software is never debated, and they get to continue to charge for similar work elsewhere.
Open-sourcing code is liable to produce the following benefits:
- It will help us discover bugs.
- It will reduce the cost of building similar software. In a federal system, many local agencies produce (or buy) similar software to help administer similar services. Having the code open-sourced is likely to reduce the barrier to entry for firms bidding to build such software and will likely lead to lower costs over time.
- Freely available software under a generous license, e.g., queue management software, optimal staffing software, etc., benefits the economy as firms do not have to invest as much in building such systems.
- It will likely increase trust in the government. For instance, where software is used to estimate benefits, the auditability of the software is likely to lead to a modest increase in confidence in the correctness of how the law has been translated into code.
There are at least three ways to open-sourcing government code. First, firms like OpenGov that produce open-source software for the government are already helping bring some of the code online. But given that the space for government software is large, it will likely take many decades for a tangible proportion of software to be open-sourced. Second, we can lobby the government to change the law so that companies (and agencies) are mandated to open source certain software they build for the government. But the prognosis is bleak, given that the government contractors are likely lobbying hard against it. The third option is to use FOIA to request code and make it available on Github. I sense that this is a tenable option.
Say that we want to measure how often people go to risky websites. Let’s assume that the measurement of risk is expensive. We have data on how often people visit each domain on the web from a large sample. The number of unique domains in the data is large, making measuring the population of domains impossible. Say there is a sharp skew in the visitation of domains. What is the fewest number of domains we need to measure to get s.e. of no greater than X per row?
Here are some ideas:
- The base solution is simple: sample domains in each row (with replacement) in proportion to views/time to get to the desired s.e. Then, collate the selected domains and get labels for those.
- Exploit the skew in the distribution. For instance, sample from 99% of the distribution and save yourself from the long tail. Bound each estimate by the unsampled 1% (which could be anything) and enjoy. For greater accuracy, do a smaller, cruder sample of the 1% and get to the +/- 10% with an n = 100. The full version of this point is as follows: we benefit from increasing the probability of including more frequently occurring domains. Taken to the extremum, you could deterministically include the most frequent domains, and then prorate the size of the sample for the rest by the size of the area under the curve. This kind of strategy can help answer: how to optimally sample skewed distributions to get the smallest s.e. with the fewest observations?
In a new paper, Chohlas-Wood et al. present three interesting points:
- Some of the major policing strategies have scant empirical support:
- The impact of “pulling over drivers for minor traffic violations” (for the alleged purpose of “[preventing] criminal activity by intercepting individuals driving to and from the scene of a crime”) in Nashville was ~ 0 on serious crimes. (See Figures 1 and 2). To get a sense of the scale of the intervention: “In 2012, the MNPD conducted traffic stops up to ten times more frequently per capita than police departments in similar U.S. cities.”
- The impact of stop and frisk in NYC on serious crime was also ~ 0. Again, to get a sense of the scale of the policy: “NYPD officers reported conducting nearly 700,000 Terry stops in 2011 alone, nearly 90% of which involved Black or Hispanic pedestrians.”
- GS: None of this is terribly surprising. All over the world, very few policies are chosen as a result of careful data analysis. Why would policing be any different? My other prior based on looking at a fair bit of US crime data is that to a first approximation, all trends are national. When policing is local and trends are national, it suggests that the way policing is done is perhaps not the most important factor in preventing crime.
- Racial bias in who is stopped:
- “[A]t any given level of risk Black and Hispanic individuals were frisked considerably more often than white individuals.” (NYC, 2011-2012)
- “[T]he rates at which frisks recover weapons are significantly lower for frisked Black individuals (3.8%) and Hispanic individuals (3.4%) compared to white individuals (5.7%).” (From the Chicago Police Department (CPD) in 2017)
- Contraband recovery rate for Blacks = 17%, Hispanics = 20%, Whites = 27% (Chicago 2014–2019, traffic stops.)
- Contraband recovery rate for Blacks = 24%, Hispanics = 23%, Whites = 34% (Philadelphia 2014–2019; traffic stops.)
- GS: I am impressed by the contraband recovery rates. Either the base rate of ‘contraband’ is super high or the police is very good. My hunch is the former but would love to see data. (See below.)
- GS: If police select who to stop based on observable characteristics (conditional on location; what else can they rely on?), criminals may be incentivized to game that reducing the value of observables over time.
- Whack-a-mole nature of policing policies
- “The settlement agreement with the ACLU took effect on January 1, 2016.85 For 2016, the CPD reported a total of approxi-mately 100,000 pedestrian stops, a sharp drop from the roughly 600,000 stops reported for 2015 (Figure 9).86 At the same time, the number of traffic stops made by the CPD began to rise. The CPD reported around 100,000 traffic stops in 2014 and a similar amount in 2015, but by 2019, the CPD reported nearly 600,000 traffic stops, with large increases occurring each year from 2016 to 2019. These traffic stops came to closely resemble the pedestrian stops that the CPD was contemporaneously under pressure to curtail. …”
- Following a consent decree and settlement in 2011, pedestrian stops fell from more than 200,000 reported stops in 2014 (the earliest year for which we have data released publicly by the city) to fewer than 100,000 reported stops in each of 2018 and 2019, while traffic stops almost doubled in the same period”
p.s. Graham sends this:
“Back in the 1990s, it looked like the Supreme Court was going to run drug checkpoints, so Indianapolis started doing one. Drivers were stopped completely at random until the Supreme Court put an end to it.
“The city conducted six such roadblocks between August and November that year, stopping 1,161 vehicles and arresting 104 motorists. Fifty-five arrests were for drug-related crimes, while 49 were for offenses unrelated to drugs. The overall “hit rate” of the program was thus approximately nine percent.”
If you take this as a baseline, police are twice as good at finding contraband as random selection. If “contraband” just means drugs, then probably four times as good. So the baseline rate of contraband is high (a surprising number of people have warrants, drugs, and weapons) but police are also beating the odds.”
Chicago is not Indianapolis and 2015 is not 2000 but still valuable.
p.p.s. Graham also highlights an issue with Figure 2. Chohlas-Wood et al. plot the murder rate per 1k on the same graphs as vehicle stops per 1k. This naturally squishes the variation in the murder rate. The general rule is that you should avoid plotting variables that vary by orders of magnitude on the same graph. At any rate, doing so gives the appearance that the authors are putting a thumb on the scale.
In an influential essay, The Cognitive Style of PowerPoints, Tufte argues that (PowerPoint) presentations are unsuitable for serious problems. The essay is largely polemical, with Tufte freely mixing points about affordances of the medium with criticisms of bad presentations and lazy broadsides.
Hilarious stuff first:
- “All 3 reports have standard PP format problems: elaborate bullet outlines; segregation of words and numbers (12 of 14 slides with quantitative data have no accompanying analysis); atrocious typography; data imprisoned in tables by thick nets of spreadsheet grids; only 10 to 20 short lines of text per slide.”
- “On this single Columbia slide, in a PowerPoint festival of bureaucratic hyper-rationalism, 6 different levels of hierarchy are used to classify, prioritize, and display 11 simple sentences”
- “In 28 books on PP presentations, the 217 data graphics depict an average of 12 numbers each. Compared to the worldwide publications shown in the table at right, the statistical graphics based on PP templates are the thinnest of all, except for those in Pravda back in 1982, when that newspaper operated as the major propaganda instrument of the Soviet communist party and a totalitarian government.”
In the essay, I could only rescue two points about affordances (that I buy):
- “When information is stacked in time, it is difficult to understand context and evaluate relationships.”
- Inefficiency: “A talk, which proceeds at a pace of 100 to 160 spoken words per minute, is not an especially high resolution method of data transmission. Rates of transmitting visual evidence can be far higher. … People read 300 to 1,000 printed words a minute, and find their way around a printed map or a 35mm slide displaying 5 to 40 MB in the visual field. Yet, in a strange reversal, nearly all PowerPoint slides that accompany talks have much lower rates of information transmission than the talk itself. As shown in this table, the PowerPoint slide typically shows 40 words, which is about 8 seconds worth of silent reading material. The slides in PP textbooks are particularly disturbing: in 28 textbooks, which should use only first-rate examples, the median number of words per slide is 15, worthy of billboards, about 3 or 4 seconds of silent reading material. This poverty of content has several sources. First, the PP design style, which typically uses only about 30% to 40% of the space available on a slide to show unique content, with all remaining space devoted to Phluff, bullets, frames, and branding. Second, the slide projection of text, which requires very large type so the audience can read the words.”
From Working Backwards, which cites the article as the reason Amazon pivoted from presentations to 6-pagers for its S-team meetings, there is one more reasonable point about presentations more generally:
“…the public speaking skills of the presenter, and the graphics arts expertise behind their slide deck, have an undue—and highly variable—effect on how well their ideas are understood.”Working Backwards
The points about graphics arts expertise, etc., apply to all documents but are likely less true for reports than presentations. (It would be great to test the effect of the prettiness of graphics on their persuasiveness.)
Reading the essay made me think harder about why we use presentations in meetings about complex topics more generally. For instance, academics frequently present to other academics. Replacing presentations with 6-pagers that people quietly read and comment on at the start of the meeting and then discuss may yield higher quality comments and better discussion and better evaluation of the scholar (and the scholarship).
p.s. If you haven’t seen Norvig’s Gettysburg Address in PowerPoint, you must.
p.p.s. Ed Haertel forwarded me this piece by Sam Wineburger on why asking students to create powerpoints is worse than asking them to write an essay.
p.p.p.s. Here’s how Amazon runs its S-team meetings (via Working Backwards):
1. 6-pager (can have appendices) distributed at the start of the discussion.
2. People read in silence and comment for the first 20 min.
3. Rest 40 min. devoted to discussion, which is organized by 1. big issues/small issues, 2. people going around the room, etc.
4. One dedicated person to take notes.
A recent piece on ESPNCricinfo analyses the DRS data and argues that cricket should do away with neutral umpires. I reanalyzed the data.
If a game is officiated by a home umpire, we expect the following:
- Hosts will appeal less often as they are likely to be happier with the decision in the first place
- When visitors appeal a decision, their success rate should be higher than the hosts. Visitors are appealing against an unfavorable call—a visiting player was unfairly given out or they felt the host player was unfairly given not out. And we expect the visitors to get more bad calls.
When analyzing success rate, I think it is best to ignore appeals that are struck down because they defer to the umpire’s call. Umpire’s call generally applies to LBW decisions, and especially two aspects of the LBW decision: 1. whether the ball was pitching in line, 2. whether it was hitting the wickets. To take a recent example, in the second test of the 2021 Ashes series, Lyon got a wicket when the impact was ‘umpire’s call’ and Stuard Broad was denied a wicket for the same reason.
Ollie Robinson Unsuccessfully Challenging the LBW Decision
Stuart Broad Unsuccessfully Challenging the Not-LBW Decision
With the preliminaries over, let’s get to the data covered in the article. Table 1 provides some summary statistics of the outcomes of DRS. As is clear, the visiting team appealed the umpire’s decision far more often than the home team: 303 vs. 264. Put another way, the visiting team lodged nearly one more appeal per test than the home team. So how often did the appeals succeed? In line with our hypothesis, the home team appeals were upheld less often (24%) than visiting team’s appeals (29%).
Table 1. Review Outcomes Under Home Umpires. 41 Tests. July 2020–Nov. 2021.
|REVIEWER TYPE||TOTAL PLAYER REVIEWS||STRUCK DOWN (%)||UMPIRE’S CALL (STRUCK DOWN) (%)||UPHELD (%)|
|HOME BATTING||96||39 (40%)||25 (26%)||32 (34%)|
|HOME BOWLING||168||108 (64%)||29 (18%)||31 (18%)|
|VISITOR BATTING||147||58 (39%)||25 (17%)||64 (44%)|
|VISITOR BOWLING||156||97 (62%)||34 (22%)||25 (16%)|
It could be the case that these results are a consequence of something to do with host vs. visitor than home umpires. For instance, hosts win a lot, and that generally means that they will bowl for shorter periods of time and bat for longer periods of time. We account for this by comparing outcomes under neutral umpires. The article has data on the same. There, you see that the visiting team makes fewer appeals (198) than the home team (214). And the visiting team’s success rate in appeals is slightly lower (29%) than the home team’s rate (30%).
At the bottom of the article is another table that breaks down reviews by host country:
|HOST COUNTRY||TESTS||UMPIRES||REVIEWS||HOSTS’ SUCCESS (%)||VISITORS’ SUCCESS (%)|
|ENGLAND||13||AG WHARF, MA GOUGH*, RA KETTLEBOROUGH*, RK ILLINGWORTH*||190||22/85 (26%)||32/105 (30%)|
|NEW ZEALAND||4||CB GAFFANEY*, CM BROWN, WR KNIGHTS||41||3/17 (18%)||5/24 (21%)|
|AUSTRALIA||4||BNJ OXENFORD, P WILSON, PR REIFFEL*||55||5/30 (17%)||6/25 (24%)|
|SOUTH AFRICA||2||AT HOLDSTOCK, M ERASMUS*||20||2/10 (20%)||3/10 (30%)|
|SRI LANKA||6||HDPK DHARMASENA*, RSA PALLIYAGURUGE||85||9/42 (21%)||13/43 (30%)|
|PAKISTAN||2||AHSAN RAZA, ALEEM DAR*||27||0/11 (0%)||6/16 (38%)|
|INDIA||5||AK CHAUDHARY, NITIN MENON*, VK SHARMA||87||9/40 (23%)||11/47 (23%)|
|WEST INDIES||6||GO BRATHWAITE, JS WILSON*||94||13/50 (26%)||13/44 (29%)|
But the data doesn’t match the one in the table above. For one, the number of tests considered is 42 than 41. For two, and perhaps relatedly, the total number of reviews is 599 than 567. To be comprehensive, let’s do the same calculations as above. The visiting team appeals more (314) than the host team (285). The host team success rate is 22% (63/285), and the visiting team success rate is 28% (89/314). If you were to do a statistical test for success rates:
prop.test(x = c(63, 89), n = c(285, 314)) 2-sample test for equality of proportions with continuity correction data: c(63, 89) out of c(285, 314) X-squared = 2.7501, df = 1, p-value = 0.09725 alternative hypothesis: two.sided 95 percent confidence interval: -0.13505623 0.01028251 sample estimates: prop 1 prop 2 0.2210526 0.2834395
The KNN classifier is one of the most intuitive ML algorithms. It predicts class by polling k nearest neighbors. Because it seems so simple, it is easy to miss a couple of the finer points:
- Sample Splitting: Traditionally, when we split the sample, there is no peeking across samples. For instance, when we split the sample between a train and test set, we cannot look at the data in the training set when predicting the label for a point in the test set. In knn, this segregation is not observed. Say we partition the training data to learn the optimal k. When predicting a point in the validation set, we must pass the entire training set. Passing the points in the validation set would be bad because then the optimal k will always be 0. (If you ignore k = 0, you can pass the rest of the dataset.)
- Implementation Differences: “Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor k+1 and k, have identical distances but different labels, the results will depend on the ordering of the training data.” (see here; emphasis mine.)
This matters when the distance metric is discrete, e.g., if you use an edit-distance metric to compare strings. Worse, scikit-learn doesn’t warn users during analysis.
In R, one popular implementation of KNN is in a package called class. (Overloading the word
classseems like a bad idea but that’s for a separate thread.) In
class, how the function deals with this scenario is decided by an explicit option: “If [the option is] true, all distances equal to the kth largest are included. If [the option is] false, a random selection of distances equal to the kth is chosen to use exactly k neighbours.”
For the underlying problem, there isn’t one clear winning solution. One way to solve the problem is to move from knn to adaptive knn: include all points that are as far away as the kth point. This is what
classin R does when the option
all.equalis set to True. Another solution is to never change the order in which the data are accessed and to make the order as part of how the model is exported.
Permutation-based methods for calculating variable importance and interpretation are increasingly common. Here are a few common places where they are used:
Feature Importance (FI)
The algorithm for calculating permutation-based FI is as follows:
- Estimate a model
- Permute a feature
- Predict again
- Estimate decline in predictive accuracy and call the decline FI
Permutation-based FI bakes in a particular notion of FI. It is best explained with an example: Say you are calculating FI for X (1 to k) in a regression model. Say you want to estimate FI of X_k. Say X_k has a large beta. Permutation-based FI will take the large beta into account when calculating the FI. So, the notion of importance is one that is conditional on the model.
Often we want to get at a different counterfactual: If we drop X_k, what happens. You can get to that by dropping and re-estimating, letting other correlated variables get large betas. I can see a use case in checking if we can knock out say an ‘expensive’ variable. There may be other uses.
Aside: To my dismay, I kludged the two together here. In my defense, I thought it was a private email. But still, I was wrong.
Permutation-based methods are used elsewhere. For instance:
We construct our knockoff matrix X˜ by randomly swapping the n rows of the design matrix X. This way, the correlations between the knockoffs remain the same as the original variables but the knockoffs are not linked to the response Y. Note that this construction of the knockoffs matrix also makes the procedure random.From https://arxiv.org/pdf/1907.03153.pdf#page=4
Local Interpretable Model-Agnostic Explanations
The recipe for training local surrogate models:
Select your instance of interest for which you want to have an explanation of its black box prediction.
Perturb your dataset and get the black box predictions for these new points.
Weight the new samples according to their proximity to the instance of interest.
Train a weighted, interpretable model on the dataset with the variations.
Explain the prediction by interpreting the local model.From https://christophm.github.io/interpretable-ml-book/lime.html
Common Issue With Permutation Based Methods
“Another really big problem is the instability of the explanations. In an article 47 the authors showed that the explanations of two very close points varied greatly in a simulated setting. Also, in my experience, if you repeat the sampling process, then the explantions that come out can be different. Instability means that it is difficult to trust the explanations, and you should be very critical.”From https://christophm.github.io/interpretable-ml-book/lime.html
One way to solve instability is to average over multiple rounds of permutations. It is expensive but the payoff is stability.
In many ML applications, especially ones where you need to train a model on customer data to get high levels of accuracy, the only models that ML SaaS companies can offer to a client out-of-the-box are bad. But many ML SaaS businesses hesitate to go to a client with a bad model. Part of the reason is that companies don’t understand that they can deliver value with a bad model. In many places, you can deliver value with a bad model by deploying a high-precision version, only offering predictions where you are highly confident. Another reason why ML SaaS companies likely hesitate is a lack of a reasonable pricing model. There, charging per correct response with some penalty for an incorrect answer may prove a good option. (If you are the sole bidder, setting the price just below the marginal cost of getting a human to label a response plus any additional business value from getting the job done more quickly may be one fine place to start.) Having such a pricing model is likely to reassure the client that they won’t be charged for the glamour of having an ML model and instead will only be charged for the results. (There is, of course, an upfront cost of switching to an ML model, which can be reasonably high and that cost needs to be assessed in terms of potential payoff over the long term.)
It is a myth that data speaks for itself. The analyst speaks for the data. The analyst chooses what questions to ask, what analyses to run, how the analyses are interpreted, and how they are summarized. I use excerpts from a paper by Gilliam et al. on media portrayal of crime as a way to highlight one set of choices by a group of analysts. (The excerpts also highlight the need for reading a paper fully than relying on the abstract alone.)
White Violent Criminals Are Overrepresented
White Nonviolent Criminals Are Overrepresented
Relative Underrepresentation Between Violent and Nonviolent Crime is a Problem
In 2013, Girshick et al. released a paper that described a technique to solve an impossible-sounding problem—classifying each pixel of an image (or semantic segmentation). The technique that they proposed, R-CNN, combines deep learning, selective search, and SVM. It also has all sorts of ad hoc choices, from the size of the feature vector to the number of regions, that are justified by how well they work in practice. R-CNN is not unusual. Many machine learning papers are recipes that ‘work.’ There is a reason for that. Machine learning is an engineering discipline. It isn’t a scientific one.
You may think that engineering must follow science, but often it is the other way round. For instance, we learned how to build things before we learned the science behind it—we trialed-and-errored and overengineered our way to many still standing buildings while the scientific understanding slowly accumulated. Similarly, we were able to predict the seasons and the phases of the moon before learning how our solar system worked. Our ability to solve problems with machine learning is similarly ahead of our ability to put it on a firm scientific basis.
Often, we build something based on some vague intuition, find that it ‘works,’ and only over time, deepen our intuition about why (and when) it works. Take, for instance, Dropout. The original paper (released in 2012, published in 2014) had the following as motivation:
A motivation for Dropout comes from a theory of the role of sex in evolution (Livnat et al., 2010). Sexual reproduction involves taking half the genes of one parent and half of the other, adding a very small amount of random mutation, and combining them to produce an offspring. The asexual alternative is to create an offspring with a slightly mutated copy of the parent’s genes. It seems plausible that asexual reproduction should be a better way to optimize individual fitness because a good set of genes that have come to work well together can be passed on directly to the offspring. On the other hand, sexual reproduction is likely to break up these co-adapted sets of genes, especially if these sets are large and, intuitively, this should decrease the fitness of organisms that have already evolved complicated coadaptations. However, sexual reproduction is the way most advanced organisms evolved. …Srivastava et al. 2014, JMLR
Moreover, the paper provided no proof and only some empirical results. It took until Gal and Ghahramani’s 2016 paper (released in 2015) to put the method on a firmer scientific footing.
Then there are cases where we have made ad hoc choices that ‘work’ and where no one will ever come up with a convincing theory. Instead, progress will mean replacing bad advice with good. Take, for instance, the recommended step of ‘normalizing’ variables before doing k-means clustering or before doing regularized regression. The idea of normalization is simple enough: put each variable on the same scale. But it is also completely weird. Why should we put each variable on the same scale? Some variables are plausibly more substantively important than others and we ideally want to prorate by that.
What Can We Learn?
The first point is about teaching machine learning. Bricklaying is thought to be best taught via apprenticeship. And core scientific principles are thought to be best taught via books and lecturing. Machine learning is closer to the bricklaying end of the spectrum. First, there is a lot in machine learning that is ad hoc and beyond scientific or even good intuitive explanation and hence taught as something you do. Second, there is plausibly much to be learned in seeing how others trial-and-error and come up with kludges to fix the issues for which there is no guidance.
The second point is about the maturity of machine learning. Over the last few decades, we have been able to accomplish really cool things with machine learning. And these accomplishments detract us from how early we are. The fact is that we have been able to achieve cool things with very crude tools. For instance, OOS validation is a crude but very commonly used tool for preventing overfitting—we stop optimization when the OOS error starts increasing. As our scientific understanding deepens, we will likely invent better tools. The best of machine learning is a long way off. And that is exciting.
The optimism in Internet browsing is palpable. Browse long enough, and you will have a ‘hit.’ Like gambling, which it mimics, lay browsing is a losing proposition. A better way to spend your time is to focus on known knowns—excellent teachers, communicators, etc.—and core ideas, insights, and big hits of a discipline (along with learning how disciplines solve problems). The rationale for the first is obvious. The rationale for the second point is three folds:
- Because we often scavenge information, many of us are not well versed in the core principles of the discipline (and adjacent disciplines) we purport to specialize in or want to learn about.
- The core ideas, the big hits, etc., by their very nature, are important and illuminating.
- Many of these big ideas are accessible, partly because people have spent time thinking about ways to communicate the points. So you will find excellent distillations of the points, and you will find that many of these ideas are on your knowledge frontier (things you can learn immediately).
Or, if you are disciplined enough, focus relentlessly on finding new things in a narrow niche. Going from gambling to anything else is not easy. The highs won’t be as high. But the average high and ROI will be a lot greater.
One conventional definition of group fairness is that the ML algorithms produce predictions where the FPR (or FNR or both) is the same across groups. Fixating on equating FPR etc. can harm the very groups we are trying to help. So it may be useful to rethink how to solve the problem of reducing unfairness.
One big reason why the FPR may vary across groups is that, given the data, some groups’ outcomes are less predictable than others. This may be because of the limitations of the data itself or because of the limitations of algorithms. For instance, Kearns and Roth in their book bring up the example of college admissions. The training data for college admissions is the decisions made by college counselors. College counselors may well be worse at predicting the success of minority students because they are less familiar with their schools, groups, etc., and this, in turn, may lead to algorithms performing worse on minority students. (Assume the algorithm to be human decision-makers and the point becomes immediately clear.)
One way to address worse performance may be to estimate the uncertainty of the prediction. This allows us to deal with people with wider confidence bounds separately from people with narrower confidence bounds. The optimal strategy for people with wider confidence bounds people may be to collect additional data to become more confident in those predictions. For instance, Komiyama and Noda propose something similar (pdf) to help overcome a lack of information during hiring. Or we may need to figure out a way to compensate people based on their uncertainty interval.
The average width of the uncertainty interval across groups may also serve as a reasonable way to diagnose this particular problem.