A Benchmark For Benchmarks

30 Dec

Benchmark datasets like MNIST, ImageNet, etc., abound in machine learning. Such datasets stimulate work on a problem by providing an agreed-upon mark to beat. Many of the benchmark datasets, however, are constructed in an ad hoc manner. As a result, it is hard to understand why the best-performing models vary across different benchmark datasets (see here), to compare models, and to confidently prognosticate about performance on a new dataset. To address such issues, in the following paragraphs, we provide a framework for building a good benchmark dataset.

I am looking for feedback. Please let me know how I can improve this.

Inter-Group Prejudice

16 Dec

Prejudice is a bane of humanity. Unjustified aversive beliefs and affect are the primary proximal causes of aversive behavior toward groups. Such beliefs and sentiments cause aversive speech and physical violence. They also serve as justification for denying people rights and opportunities. Prejudice also creates a deadweight loss. For instance, many people refuse to trade with groups they dislike. Prejudice is the reason why so many people lead diminished lives.

So why do so many people have incorrect aversive beliefs about other groups (and commensurately, unjustified positive beliefs about their group)?

If you have suggestions about how to improve the essay, please email me.

Compensation With Currency With No Agreed Upon Value

14 Dec

Equity is an integral part of start-up compensation. However, employees and employers may disagree about the value of equity. Employers, for instance, may value equity higher than potential employees because they have access to better data or simply because they are more optimistic. One consequence of the disagreement between potential employees’ and employers’ valuations of equity is that some salary negotiations may fail. In the particular scenario that I highlightabove, one way out of the quandary may be to endow an employee with options commensurate with their lower valuation and have a buy-back clause if the employer’s prediction pans out (when the company is valued in the next round or during exit). Another way to interpret this particular trade is as trading risk for a cap on the upside. Thus, this kind of strategy may also be useful where employees are more risk-averse than employers.

Optimally Suboptimal: Behavioral-Economic Product Features

14 Dec

Booking travel online feels like shopping in an Indian bazaar: a deluge of options, no credible information, aggressive hawkers (“recommendations” and “targeted ads”), and hours of frantic search that ends with purchasing something more out of exhaustion than conviction. Online travel booking is not unique in offering this miserable experience. Buying on Amazon feels like a similar sand trap. But why is that? Poor product management? A more provocative but perhaps more accurate answer is that the product experience, largely unchanged or becoming worse in the case of Amazon, is “optimal.” Many people enjoy the “hunt.” They love spending hours on end looking for a deal, comparing features, and collecting and interpreting wisps of information. To satiate this need, the “optimal” UI for a market may well be what you see on Amazon or travel booking sites. The lack of trustworthy information is a feature, not a bug.

The point applies more broadly. A range of products have features that have no other purpose than gaming behavioral concerns. Remember the spinning wheel on your tax preparation software as the software looks for all the opportunities to save you money? That travesty is in the service of convincing users that the software is ‘working hard.’ Take another example. Many cake mixes sold today require you to add an egg. That ruse was invented to give housewives (primarily the ones who were cooking say 50 years ago) the feeling that they were cooking. One more. The permanent “sales” at Macy’s and at your local grocery store mean that everyone walks out feeling like a winner. And that means a greater likelihood of you coming back again.

p.s. When the users don’t trust the website, the utility of recommendations in improving consumer surplus ~ 0 among sophisticated users.

Related: https://gojiberries.io/2023/09/09/not-recommended-why-current-content-recommendation-systems-fail-us/

Time Will Tell

23 Nov

Part of empirical social science is about finding fundamental truths about people. It is a difficult enterprise partly because scientists only observe data in a particular context. Neither cross-sectional variation nor data that goes back at best by tens of years is often enough to come up with generalizable truths. Longer observation windows help clarify what is an essential truth and what is, at best, a contextual truth. 

Support For Racially Justified and Targeted Affirmative Action

Sniderman and Carmines (1999) find that a large majority of Democrats and Republicans oppose racially justified and targeted affirmative action policies. They find that opposition to racially targeted affirmative action is not rooted in prejudice. Instead, they conjecture that it is rooted in adherence to the principle of equality. The authors don’t say it outright but the reader can surmise that in their view, opposition to racially justified and targeted affirmative action is likely to be continued and broad-based. It is a fair hypothesis. Except 20 years later, a majority of Democrats support racially targeted and racially justified affirmative action in education and hiring (see here).

What’s the Matter with “What’s the Matter with What’s the Matter with Kansas”?

It isn’t clear Bartels was right about Kansas even in 2004 (see here) (and that isn’t to say Thomas Frank was right) but the thesis around education has taken a nosedive. See below.

Split Ticket Voting For Moderation

On the back of record split ticket voting, Fiorina (and others) theorized “divided government is the result of a conscious attempt by the voters to achieve moderate policy.” Except very quickly split ticket voting declined (with of course no commensurate radicalization of the population) (see here).

Effect of Daughters on Legislator Ideology

Having daughters was thought to lead politicians to vote more liberally (see here) but more data suggested that this stopped in the polarized era (see here). Yet more data suggested that there was no trend for legislators with daughters to vote liberally before the era covered by the first study (see here).

Why Social Scientists Fail to Predict Dramatic Social Changes

19 Nov

Soviet specialists are often derided for their inability to see the coming collapse of the Soviet Union. But they were not unique. If you look around, social scientists have very little handle on many of the big social changes that have happened over the past 70 or so years.

  1. Dramatic decline in smoking. “The percentage of adults who smoke tobacco has declined from 42% in 1965 (the first year the CDC measured this), to 12.5% in 2020.” (see here.)
  2. Large infrastructure successes in a corrupt, divided developing nation. Over the last 20 or so years, India has pulled off Aadhar, UPI, FastPass, etc., dramatically increased the number of electrified villages, the number of people with access to toilets, the length of highways, etc. 
  3. Dramatic reductions in prejudice against Italians, the Irish, Asians, Women, African Americans, LGBT, etc. (see here, here, etc.)
  4. Dramatic decline in religion, e.g., church-going, etc., in the West.
  5. Dramatic decline in marriage. “According to the study, the marriage rate in 1970 was at 76.5%, and today, it stands at just over 31%.” (see here.)
  6. Obama or Trump. Not many would have given the odds of America electing a black president in 2006. Or electing Trump in 2016.

The list probably spans all the big social changes. How many would have bet on the success of China? Or for what matter Bangladesh, whose HDI are at par or ahead of its more established South Asian neighbors? Or the dramatic liberalization that is underway in Saudi Arabia? After all, the conventional argument before MBS was that the Saudi monarchy had made a deal with the mullahs and that any change would be met with a strong backlash.

All of that begs the question: why? One reason social scientists fail to predict dramatic social change may be because they think the present reflects the equilibrium. For instance, take racial attitudes. The theories about racial prejudice have mostly been defined by the idea that prejudice is irreducible. The second reason may be that most data that social scientists have is cross-sectional or collected over short periods and there isn’t much you can see (especially about change) from small portholes. The primary evidence they have is about lack of change when world looked over longer time spans is defined by astounding change on many dimensions. The third reason may be that social scientists suffer from negativity bias. They are focused on explaining what’s wrong with the world and interpreting data in ways that highlight conventional anxieties. This means that they end up interrogating progress (which is a fine endeavor) but spend too little time acknowledging and explaining real progress. Ideology also likely plays a role. For instance, few notice the long standing progressive racial bias in television; see here for a fun example of the interpretation gymnastics.

p.s. Often, social scientists not just fail to predict but struggle to explain what underlies the dramatic changes years later. Worse, social scientists do not seem to change their mental models based on the changes.

p.p.s. So what changes do I predict? I predict a dramatic decline in caste prejudice in India because of the following reasons: 1. dramatic generational turnover, 2. urbanization, 3. uninformative last names (outside of local context and excluding a maximum of 20% of the last names, e.g., last name ‘kumar’, which means ‘boy’, is exceedingly common, 4. high intra-group variance in physical features, 5. the preferred strategy for a prominent political party is to minimize intra-Hindu religious differences, 6. the current media + religious elites are mostly against caste prejudice. I also expect fairly rapid declines in prejudice against women (though far less steeper than caste) given some of the same reasons.

Against Complacency

19 Nov

Even the best placed among us are to be pitied. Human lives today are blighted by five things:

  1. Limited time. While we have made impressive gains in longevity over the last 100 years, our lives are still too short. 
  2. Less than excellent health. Limited lifespan is further blighted by ill-health. 
  3. Underinvestment. Think about Carl Sagan as your physics teacher, a full-time personal trainer to help you excel physically, a chef, abundant access to nutritious food, a mental health coach, and more. Or an even more effective digital or robotic analog.
  4. Limited opportunity to work on impactful things. Most economic production happens in areas where we are not (directly) working to dramatically enhance human welfare. Opportunities to work on meaningful things are further limited by economic constraints.
  5. Crude tools. The tools we work with are much too crude which means that many of us are stuck executing on a low plane.

Deductions

  1. Given where we are in terms of human development, innovations in health and education are likely the most impactful though innovations in foundational technologies like AI and computation that increase our ability to innovate are probably still more important.
  2. Given that at least a third of the economy is government money in many countries, government can dramatically affect what is produced, e.g., the pace at which we increase longevity, prevent really bad outcomes like an uninhabitable planet, etc.

Traveling Salesman

18 Nov

White-collar elites venerate travel, especially to exotic and far-away places. There is some justification for the fervor—traveling is pleasant. But veneration creates an umbra that hides some truths:

  1. Local travel is underappreciated. We likely underappreciate the novelty and beauty available locally.
  2. Virtual travel is underappreciated. We know all the ways virtual travel doesn’t measure up to the real experience. But we do not ponder enough about how the gap between virtual and physical travel has closed, e.g., high-resolution video, and how some aspects of virtual travel are better:
    1. Cost and convenience. The comfort of the sofa beats the heat and the cold, the crowds, and the fatigue.
    2. Knowledgeable guides. Access to knowledgeable guides online is much greater than offline. 
    3. New vistas. Drones give pleasing viewing angles unavailable to lay tourists.
    4. Access to less visited places. Intrepid YouTubers stream from places far off the tourist map, e.g., here.
  3. The tragedy of the commons. The more people travel, the less appealing it is for everyone because a) travelers change the character of a place and b) the crowds come in the way of enjoyment.
  4. The well-traveled are mistaken as being intellectually sophisticated. “Immersion therapy” can expand horizons by challenging perspectives. But often travel needs to be paired with books, needs to be longer, the traveler needs to make an effort to learn the language, etc., for it to be ‘improving.’
  5. Traveling by air is extremely polluting. A round-trip between LA and NYC emits .62 tons of CO2 which is the same as CO2 generated from driving 1200 miles.

Limits of Harms From Affirmative Action

17 Nov

Stories abound about unqualified people getting admitted to highly selective places because of quotas. But the chances are that these are merely stories with no basis in fact. If an institution is highly selective and if the number of applicants is sufficiently large, quotas are unlikely to lead to people with dramatically lower abilities being admitted even when there are dramatic differences across groups. Relatedly, it is unlikely to have much of an impact on the average ability of the admitted cohort. If the point wasn’t obvious enough, it would be after the following simulation. Say the mean IQ of the groups differs by 1 s.d. (which is the difference between Black and White IQ in the US). Say that the admitting institution only takes 1000 people. In the no-quota regime, the top 1000 people get admitted. In the quota regime, 20% of the seats are reserved for the second group. With this framework, we can compare the IQ of the last admitee across the conditions. And the mean ability.

# Set seed for reproducibility
set.seed(123)

# Simulate two standard normal distributions
group1 <- rnorm(1000000, mean = 0, sd = 1)  # Group 1
group2 <- rnorm(1000000, mean = -1, sd = 1)  # Group 2, mean 1 sd lower than Group 1

# Combine into a dataframe with a column identifying the groups
data <- data.frame(
  value = c(group1, group2),
  group = rep(c("Group 1", "Group 2"), each = 1000000)
)

# Pick top 800 values from Group 1 and top 200 values from Group 2
top_800_group1 <- head(sort(data$value[data$group == "Group 1"], decreasing = TRUE), 800)
top_200_group2 <- head(sort(data$value[data$group == "Group 2"], decreasing = TRUE), 200)

# Combine the selected values and estimate the mean
combined_top_1000 <- c(top_800_group1, top_200_group2)

# IQ of the last five admitees
round(tail(head(sort(data$value, decreasing = TRUE), 1000)), 2)
[1] 3.11 3.11 3.10 3.10 3.10 3.10

round(tail(combined_top_1000), 2)
[1] 2.57 2.57 2.57 2.57 2.56 2.56

# Means
round(mean(head(sort(data$value, decreasing = TRUE), 1000)), 2)
[1] 3.37

round(mean(combined_top_1000), 2)
[1] 3.31

# How many people in top 1000 from Group 2 in no-quota?
sorted_data <- data[order(data$value, decreasing = TRUE), ]
top_1000 <- head(sorted_data, 1000)
sum(top_1000$group == "Group 2")
[1] 22

Under no-quota, the person with the least ability who is admitted is 3.1 s.d. above the mean while under quota, the person with the least ability who is admitted is 2.56 s.d. above the mean. The mean ability of the admitted cohort is virtually indistinguishable—3.37 and 3.31 for the no-quota and quota conditions respectively. Not to put too fine a point—the claim that quotas lead to gross misallocation of limited resources is likely grossly wrong. This isn’t to say there isn’t a rub. With a 1 s.d. difference, the representation in the tails is grossly skewed. Without quota, there would be just 22 people from Group 2 in the top 1000. So 178 people from Group 1 get bumped. This point about fairness is perhaps best thought of in context of how much harm comes to those denied admission. Assuming enough supply across the range of selectivity—this is approximately true for the U.S. for higher education with a range of colleges at various levels of selectivity—it is likely the case that those denied admission at more exclusive institutions get admitted at slightly lower ranked institutions and do nearly as well as they would have had they been admitted to more exclusive institutions. (See Dale and Kreuger, etc.).

p.s. In countries like India, 25 years ago, there was fairly limited supply at the top and large discontinuous jumps. Post liberalization of the education sector, this is likely no longer true.

p.p.s. What explains the large racial gap in SAT scores of the admittees to Harvard? It is likely that it is founded in Harvard weighing factors such as athletic performance in admission decisions.

Missing Market for Academics

16 Nov

There are a few different options for buying time with industry experts, e.g., https://officehours.com/, https://intro.co/, etc. However, there is no marketplace for buying academics’ time. Some surplus is likely lost as a result. For one, some academics want advice on what they write. To get advice, they have three choices—academic friends, reviewers, or interested academics at conferences or talks. All three have their problems. Or they have to resort to informal markets like Kahneman. 

“He called a young psychologist he knew well and asked him to find four experts in the field of judgment and decision-making, and offer them $2,000 each to read his book and tell him if he should quit writing it. “I wanted to know, basically, whether it would destroy my reputation,” he says. He wanted his reviewers to remain anonymous, so they might trash his book without fear of retribution.”

https://www.vanityfair.com/news/2011/12/michael-lewis-201112

For what it’s worth, Kahneman’s book still had major errors. And that may be the point. Had he access to a better market, with ratings on the ability to review quantitative material, he may not have had the errors. A fully fleshed market could offer options to workers to price discriminate based on whether the author is a graduate student or a tenured professor at a top-ranked private university. Such a market may also prove a useful revenue stream for academics with time and talent who want additional money.

Reviewing is but one example. Advice on navigating the academic job market, research design, etc., can all be sold.

Striking Changes Among Democrats on Race and Gender

10 Nov

The election of Donald Trump led many to think that Republicans have changed, especially on race related issues. But the data suggest that the big changes in public opinion on racial issues over the last decade or so have been among Democrats. Since 2012, Democrats have become strikingly more liberal on race, on issues related to women, and the LGBT over the last decade or so.

Conditions Make It Hard for Blacks to Succeed

The percentage of Democrats strongly agreeing with the statement more than doubled between 2012 (~ 20%) and 2020 (~ 45%).

Source: ANES

Affirmative Action in Hiring/Promotion

The percentage of Democrats for affirmative action for Blacks in hiring/promotion nearly doubled between 2012 (~ 26%) and 2020 (~ 51%).

Source: ANES

Fun fact: Support for caste based and gender based reservations in India is ~4x+ higher than support for race based Affirmative Action in the US. See here.

Blacks Should Not Get Special Favors to Get Ahead

The percentage of Democrats strongly disagreeing with the statement nearly tripled between 2012 (~ 13%) and 2020 (~ 41%).

Source: ANES

See also Sniderman and Carmines who show that support for the statement is not rooted in racial prejudice.

Feelings Towards Racial Groups

Democrats in 2020 felt more warmly toward Blacks, Hispanics, and Asians than Whites.

Source: ANES

White Democrats’ Feelings Towards Various Racial Groups

White Democrats in 2020 felt more warmly toward Asians, Blacks, and Hispanics than Whites.

Democrats’ Feelings Towards Gender Groups

Democrats felt 15 points more warmly toward feminists and LGBT in 2020 than in 2012.

Source: ANES

American PII: Lapses in Securing Confidential Data

23 Sep

At least 83% of Americans have had their confidential data shared with a company breached (see here and here). The list of most frequently implicated companies in the loss of confidential data makes for sobering reading. Reputable companies like Linkedin (Microsoft), Adobe, Dropbox, etc., are among the top 20 worst offenders. 

Source: Pwned: The Risk of Exposure From Data Breaches

There are two other seemingly contradictory facts. First, many of the companies that haven’t been able to safeguard confidential data have some kind of highly regarded security certification like SOC-2 (see, e.g., here). The second is that many data breaches are caused by elementary errors, e.g., “the password cryptography was poorly done and many were quickly resolved back to plain text” (here).

The explanation for why companies with highly regarded security certifications fail to protect the data is probably mundane. Supporters of these certifications may rightly claim that these certifications dramatically reduce the chances of a breach without eliminating it. And a 1% error rate can easily lead to the observation we started with.

So, how do we secure data? Before discussing solutions, let me describe the current state. In many companies, PII data is spread across multiple databases. Data protection is based on processes set up for controlling access to data. The data may also be encrypted, but it generally isn’t. Many of these processes to secure the data are also auditable and certifications are granted based on audits.

Rather than relying on adherence to processes, a better bet might be to not let PII data percolate across the system. The primary options for prevention are customer-side PII removal and ingestion-time PII removal. (Methods like differential privacy can be used at either end and in how automated data collection services are setup.) Beyond these systems, you need a system for cases where PII data is shown in the product. One way to handle such cases is to build a system where the PII is hashed during ingest and looked up right before serving from a system that is yet more tightly access controlled. All of these things are well known. Their lack of adoption is partly due to the fact that these services have yet to be abstracted out enough that adding them is as easy as editing a YAML file. And there lies an opportunity.

Not Recommended: Why Current Content Recommendation Systems Fail Us

9 Sep

Recommendation systems paint a wonderful picture: The system automatically gets to know you and caters to your preferences. And that is indeed what happens except that the picture is warped. Warping happens for three reasons. The first is that humans want more than immediate gratification. However, the systems are designed to learn from signals that track behaviors in an environment with strong temptation and mostly learn “System 1 preferences.” The second reason is use of the wrong proxy metric. One common objective function (on content aggregation platforms like YouTube, etc.) is to maximize customer retention (a surrogate for revenue and profits). (It is likely that the objective function doesn’t vary between subscribers and ad-based tier.) And the conventional proxy for retention is time spent on a product. It doesn’t matter much how you achieve that; the easiest way is to sell Fentanyl. The third problem is the lack of good data. Conventionally, the choices of people whose judgment I trust (and the set of people whose judgments these people trust) are a great signal. But they do not make it directly into recommendations on platforms like YouTube, Netflix, etc. Worse, recommendations based on similarity in consumption don’t work as well because of the first point. And recommendations based on the likelihood of watching often reduce to recommending the most addictive content. 

Solutions

  1. More Control. To resist temptation, humans plan ahead, e.g., don’t stock sugary snacks at home. By changing the environment, humans can more safely navigate the space during times when impulse control is weaker.
    • Rules. Let people write rules for the kinds of video they don’t want to be offered.
    • Source filtering. On X (formerly Twitter), for instance, you can curate your feed by choosing who to follow. (X has ‘For You’ and ‘Following’ tabs.) The user only sees tweets that the users they follow tweet or retweet. (On YouTube, you can subscribe to channels but the user sees more than the content produced by the channels they subscribe to.)
    • Time limits. Let people set time limits (for certain kinds of content).
    • Profiles Offer a way to switch between profiles.
  2. Better Data
    • Get System 2 Data. Get feedback on what people have viewed at a later time. For instance, in the history view, allow people to score their viewing history.
    • Network data. Only get content from people whose judgment you trust. This is different from #1a, which proposes allowing filtering on content producers.
  3. Information. Provide daily/weekly/monthly report cards on how much time was spent watching what kind of content, and what times of the day/week were where the person respected their self-recorded preferences (longer-term).
  4. Storefronts. Let there be a marketplace of curation services (curators). And let people visit the ‘store’ than the warehouse (and a particular version of curation).

Acknowledgment. The article benefitted from discussion with Chris Alexiuk and Brian Whetter.

Not Normal, Optimal

27 Aug

Reports of blood work generally include guides for normal ranges. For instance, for LDL-C, in the US, a score of < 100 (mg/DL) is considered normal. But neither the reports nor doctors have much to say about what LDL-C level to aspire for. The same holds true for things like the A1c. Based on statin therapy studies, it appears there are benefits to reducing LDL-C to 70 (and likely further). Informing people what they can do to maximize their lifespan based on available data is likely useful.

Source: Chickpea And Bean

Lest this cause confusion, the point is orthogonal to personalized ranges of ‘normal.’ Most specialty associations provide different ‘target’ ranges for people with different co-morbidities. For instance, older people with diabetes (a diagnosis of diabetes is based on a somewhat arbitrary cut-off) are recommended to aim for LDL-C levels below 70. My point is simply that the lifespan maximizing number maybe 20. None of this is to say that is achievable or the patient would choose the trade-offs, e.g., eating boiled vegetables, taking statins (which have their own side-effects), etc. It isn’t even to say that the trade-offs would have a positive expected value. (I am assuming that the decision to medicate or not is based on an expected value calculation with the relevant variables being the price of disability-adjusted life-year (~ $70k in the US), and the cost of the medicine (including side-effects).) But it does open up the opportunity to ask the patient to pay for their medicine. (The DALY is but the mean. The willingness to pay for DALY may vary substantially and we can fund everything above the mean by asking the payer.)

Smallest Loss That Compute Can Buy

15 Aug

With Chris Alexiuk and Atul Dhingra

The most expensive portion of model training today is GPU time. Given that, it is useful to ask what is the best way to spend the compute budget. More formally, the optimization problem is: minimize test loss given a FLOPs budget. To achieve the smallest loss, there are many different levers that we can pull, including, 

  1. Amount of data. 
  2. Number of parameters. There is an implicit trade-off between this and the previous point given a particular amount of compute. 
  3. Optimization hyperparameters. For e.g., Learning rate, learning rate schedule, batch size, optimizer, etc. 
  4. Model architecture
    1. Width-to-depth ratio.
    2. Deeper aspects of model architecture. For e.g., RETRO, MoE models like switch transformers, MoE with expert choice, etc.
  5. Precision in which the parameters and hyperparameters are stored.
  6. Data quality. As some of the recent work shows, data quality matters a lot. 

We could reformulate the optimization problem to make it more general. For instance, rather than use FLOPs or GPU time, we may want to use dollars. This opens up opportunities to think about how to purchase GPU time most cheaply, e.g., using spot GPUs. We can abstract out the optimization problem further. If we knew the ROI of the prediction task, we could ask what is the profit-maximizing loss given a constraint on latency. Inference ROI is a function of ~ accuracy (or another performance metric of choice) and the compute cost of inference.

What Do We Know?

Kaplan et al. (2020) and Hoffman et al. (2022) study a limited version of the problem for autoregressive modeling of language using dense (compared to Mixture-of-Experts models) transformer models. The papers primarily look at #1 and #2 though Hoffman et al. (2022) also study the impact of learning rate schedule and Kaplan et al. (2020) provide limited analysis of width-to-depth ratio and batch size (see separate paper featuring Kaplan).

Kaplan et al. uncover a chock-full of compelling empirical patterns including: 

  1. Power Laws. “The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.”
  2. Future test loss is predictable. “By extrapolating the early part of a training curve, we can roughly predict the loss that would be achieved if we trained for much longer.” 
  3. Models generalize. “When we evaluate models on text with a different distribution than they were trained on, the results are strongly correlated to those on the training validation set with a roughly constant offset in the loss.”
  4. Don’t train till convergence. “[W]e attain optimal performance by training very large models and stopping significantly short of convergence.” This is a great left field find. You get the same test loss with a larger model that is not trained to convergence as with a smaller model trained till convergence except it turns out that former is compute optimal.

Hoffman et al. assume #1, replicate #2 and #4, and have nothing to say about #3. One place where the papers differ is around the specifics of the claim about large models’ sample efficiency with implications for #4. Both agree that models shouldn’t be trained till convergence but whereas Kaplan et al. conclude that “[g]iven a 10x increase computational budget, … the size of the model should increase 5.5x while the number of training tokens should only increase 1.8x” (Hoffman et al.), Hoffman et al. find that “model size and the number of training tokens should be scaled in equal proportions.” Because of this mismatch Hoffman et al. find that most commercial models (which are trained in line with Kaplan et al.’s guidance) are undertrained. They drive home the point by showing that a 4x smaller model (Chinchilla) with 4x the data outperforms (this bit is somewhat inconsistent with their prediction) the larger model (Gopher) (both use the same compute). They argue that Chinchilla is optimal given that inference (and fine-tuning costs) for smaller models are lower.

All of this means that there is still much to be discovered. But the discovery of patterns like the power law leaves us optimistic about the discovery of other interesting patterns.

Why Are the Prices the Same?

14 Aug

From https://www.walmart.com/tp/kellogg-cereals

From https://www.walmart.com/browse/food/ice-cream/hagen-dazs/976759_976791_9551235_5459614/

Many times within a narrow product category like breakfast cereals, ice cream tubs, etc., the prices of different varieties within a brand are the same. The same pattern continues in many ice cream stores where you are charged for the quantity instead of the flavor or the vessel in which ice cream is served. It is unlikely that input costs are the same across varieties. So what explains it? It could be that the prices are the same because the differences in production costs are negligible. Or it could be that retailers opt for uniform pricing because of managerial overhead (see also this paper). Or there could be behavioral reasons. Consumers may shop in a price-conscious manner if the prices are different and may buy less. 

Breakfast cereals have another nuance. As you can see in the graphic above, the weight of the ‘family size’ box (which has the same size and shape) varies. It may be because there are strong incentives to keep the box size the same. This in turn may be because of stocking convenience or behavioral reasons, e.g., consumers may think they are judging between commensurate goods if the boxes are the same size. (It could also be that consumers pay for volume not weight.)

Cracking the Code: Addressing Some of the Challenges in Research Software

2 Jul

Macro Concerns

  1. Lack of Incentives for Producing High-Quality Software. Software’s role in enabling and accelerating research cannot be overstated. But the incentives for producing software in academia are still very thin. One reason is that people do not cite the software they use; the academic currency is still citations.
  2. Lack Ways to Track the Consequences of Software Bugs (Errors). (Quantitative) Research outputs are a function of the code researchers write themselves and the third-party software they use. Let’s assume that the peer review process vets the code written by the researcher. This leaves code written by third-party developers. What precludes errors in third-party code? Not much. The code is generally not peer-reviewed though there are efforts underway. Conditional on errors being present, there is no easy way to track bugs and their impact on research outputs.
  3. Developers Lack Data on How the Software is Being (Mis)Used. The modern software revolution hasn’t caught up with the open-source research software community. Most open-source research software is still distributed as a binary and emits no logs that can be analyzed by the developer. The only way a developer becomes aware of an issue is when a user reports the issues. This leaves errors that don’t cause alerts or failures, e.g., when a user user passes data that is inconsistent with the assumptions made when designing the software, and other insights about how to improve the software based on usage. 

Conventional Reference Lists Are the Wrong Long-Term Solution for #1 and #2

Unlike ideas, which need to be explicitly cited, software dependencies are naturally made explicit in the code. Thus, there is no need for conventional reference lists (~ a bad database). If all the research code is committed to a system like Github (Dataverse lacks the tools for #2) with enough meta information about (the precise version of the) third-party software being used, e.g., import statements in R, etc., we can create a system like the Github dependency graph to calculate the number of times software has been used (and these metrics can be shown on Google Scholar, etc.) and also create systems that trigger warnings to authors when consequential updates to underlying software are made. (See also https://gojiberries.io/2019/03/22/countpy-incentivizing-more-and-better-software/).

Conventional reference lists may however be the right short-term solution. But the goalpost moves to how to drive citations. One reason researchers do not cite software is that they don’t see others doing it. One way to cue that software should be cited is to show a message when the software is loaded — please cite the software. Such a message can also serve as a reminder for people who merely forget to cite the software. For instance, my hunch is that one of the stargazer has been cited more than 1,000 times (June 2023) is because the package produces a message .onAttach to remind the user to cite the package. (See more here.)

Solution for #3

Spin up a server that open source developers can use to collect logs. Provide tools to collect remote logs. (Sample code.)

p.s. Here’s code for deriving software citations statistics from replication files.

When Is Discrimination Profit-Maximizing?

16 May

Consider the following scenario: There are multiple firms looking to fill identical jobs. And there are multiple eligible workers given each job opening. Both the company and the workers have perfect information, which they are able toacquire without cost. Assume also that employees can switch jobs without cost. Under these conditions, it is expensive for employers to discriminate. If company A prejudicially excludes workers from Group X, company B can hire the same workers at a lower rate (given that the demand for them is lower) and outcompete company A. It thus reasons thatdiscrimination is expensive. Some people argue that for the above reasons, we do not need anti-discrimination policies. 

There is a crucial, well-known, but increasingly under-discussed nuance to the above scenario. When consumers or co-workers also discriminate, it may be profit-maximizing for a firm to discriminate. And the point fits the reality of 60 years ago when many hiring ads specifically banned African Americans from applying (‘Whites only’, ‘Jews/Blacks need not apply’, etc.), many jobs had dual wage scales, and explicitly segregated job categories existed. A similar point applies to apartment rentals. If renters discriminate by the race of the resident, the optimal strategy for an apartment block owner is to discriminate by race. Indian restaurants provide another example. If people prefer Brahmin cooks (for instance, see here, here, and here), the profit-maximizing strategy for restaurants is to look for Brahmin cooks (for instance, see here). All of this is to say that under these conditions, you can’t leave it to the markets to stop discrimination.

Generative AI and the Market for Creators

26 Apr

Many widely used machine-learning models rely on copyrighted data. For instance, Google finds the most relevant web pages for a search term by relying on a machine learning model trained on copyrighted web data. But the use of copyrighted data by machine learning models that generate content (or give answers to search queries than link to sites with the answers) poses new (reasonable) questions about fair use. By not sharing the proceeds, such systems also kill the incentives to produce original content on which they rely. For instance, if we don’t incentivize content producers, e.g., people who respond to Stack Overflow questions, the ability of these models to answer questions in new areas is likely to be lower. The concern about fair use can be addressed by training on data from content producers that have opted to share their data. The second problem is more challenging. How do you build a system that shares proceeds with content producers?

One solution is licensing. Either each content creator licenses data independently or becomes part of a consortium that licenses data in bulk and shares the proceeds. (Indeed Reddit, SO, etc. are exploring this model though they have yet to figure out how to reward creators.) Individual licensing is unlikely to work at scale so let’s interrogate the latter. One way the consortium could work is by sharing the license fee equally among the creators, perhaps pro-rated by the number of items. But such a system can easily be gamed. Creators merely need to add a lot of low-quality content to bump up their payout. And I expect new ‘creators’ to flood the system. In equilibrium, it will lead to two bad outcomes: 1. An overwhelming majority of the content is junk. 2. Nobody is getting paid much.

The consortium could solve the problem by limiting what gets uploaded but it is expensive to do. Another way to solve the problem is by incentivizing at a person-item level. There are two parts to this—establishing what was used and how much and pro-rating the payouts by value. To establish what item was used in what quantity, we may want a system that estimates how similar the generated content is to the underlying items. (This is an unsolved problem.) The payout would be prorated by similarity. But that may not incentivize creators who value their content a lot, e.g., Drake, to be part of the pool. One answer to that is to craft specialized licensing agreements as is commonly done by streamlining platforms. Another option would be to price the contribution. One way to price the contribution would be to generate counterfactuals (remove an artist) and price them in a marketplace. But it is possible to imagine that there is natural diversity in what is created and you can model the marginal contribution of an artist. The marketplace analogy is flawed because there is no one marketplace. So the likely way out is for all major marketplaces to subscribe to some credit allocation system.

Money is but one reason why people produce. Another reason people produce content is so that they can get rewarded for their reputations, e.g., SO. Generative systems built on these data however have not been implemented in a way to keep these markets intact. The current systems reduce traffic and do not give credit to the people whose answers they learn from. The result is that developers have less of an incentive to post to SO. And SO licensing its content doesn’t solve this problem. Directly tying generative models to user reputations is hard partly because generative models are probabilistically mixing things and may not produce the right answer but if the signal is directionally correct, it could be fed back to reputation scores of creators.

How Numerous Are the Numerate?

14 Feb

I recently conducted a survey on Lucid and posed a short quiz to test basic numeracy:

  • A man writes a check for $100 when he has only $70.50 in the bank. By how much is he overdrawn? — $29.50, $170.50, $100, $30.50
  • Imagine that we roll a fair, six-sided die 1000 times. Out of 1000 rolls, how many times do you think the die would come up as an even number? — 500, 600, 167, 750
  • If the chance of getting a disease is 10 percent, how many people out of 1,000 would be expected to get the disease? — 100, 10, 1000, 500
  • In a sale, a shop is selling all items at half price. Before the sale, the sofa costs $300. How much will it cost on sale? — $150, $100, $200, $250
  • A second-hand car dealer is selling a car for $6,000. This is two-thirds of what it cost new. How much did the car cost new? — $9,000, $4,000, $12,000, $8,000
  • In the BIG BUCKS LOTTERY, the chances of winning a $10 prize are 1%. What is your best guess about how many people would win a $10 prize if 1000 people each buy a single ticket from BIG BUCKS? — 10, 1, 100, 50

I surveyed 800 adult Americans. Of the 800, only 674 respondents (about 84%) cleared the attention check—a question designed to test if the respondents were paying attention or not. I limit the analysis to these 674 respondents.

A caveat before the results. I do not adjust the scores for guessing.

Of these respondents, just about a third got all the answers correct. Another quarter got 5 out of 6 correct. Another 19% got 4 out of 6 right. The remaining 20% got 3 or fewer questions right. The table below enumerates the item-wise results.

ItemProportion Correct
Overdraft.83
Dice.68
Disease.88
Sofa Sale.97
Car.66
Lottery.63

The same numbers are plotted below.

p.s. You may be interested in reading this previous blog based on MTurk data.