Congenial Invention and the Economy of Everyday (Political) Conversation

31 Dec

Communication comes from the Latin word communicare, which means `to make common.’ We communicate not only to transfer information, but also to establish and reaffirm identities, mores, and meanings. (From my earlier note on a somewhat different aspect of the economy of everyday conversation.) Hence, there is often a large incentive for loyalty. More generally, there are three salient aspects to most private interpersonal communication about politics — shared ideological (or partisan) loyalties, little knowledge of, and prior thinking about political issues, and a premium for cynicism. The second of these points — ignorance — cuts both ways. It allows for the possibility of getting away with saying something that doesn’t make sense (or isn’t true). And it also means that people need to invent stuff if they want to sound smart etc. (Looking stuff up is often still too hard. I am often puzzled by that.)

But don’t people know that they are making stuff up? And doesn’t that stop them? A defining feature of humans is overconfidence. And people often times aren’t aware of the depth of the wells of their own ignorance. And if it sounds right, well it is right, they reason. The act of speaking is many a time an act of invention (or discovery). And we aren’t sure and don’t actively control how we create. (Underlying mechanisms behind how we create — use of ‘gut’ are well-known.) Unless we are very deliberate in speech. Most people aren’t. (There generally aren’t incentives to be.) And find it hard to vet the veracity of the invention (or discovery) in the short time that passes between invention and vocalization.

I Recommend It: A Recommender System For Scholarship Discovery

1 Oct

The rate of production of scholarship has never been higher. And while our ability to discover relevant scholarship per unit of time has never kept pace with the production of knowledge, it has also risen sharply—most recently, due to Google scholar.

The efficiency of discovery of relevant scholarship, however, has plateaued over the last few years, even as the rate of production of scholarship has kept its steady upward climb. Part of the reason why the growth has plateaued is because current ways of doing things cannot be made considerably more efficient very quickly. New growth in the rate of discovery will need knowledge discovery systems to get more data on the user’s needs, and access to structured databases of academic production.

The easiest next step would perhaps be to build a collaborative recommendation system. In particular, think of a system that takes your .bib file, trawls a large database of citation lists, for example, JSTOR or PLoS, and produces recommendations for scholarship you may have missed. The logic of a collaborative recommender system is pretty simple: learn from articles which have cited similar scholarship as you. If we have metadata on scholarship, for instance, sub-field, the actual text of an article or even the abstract, we could recommend based on the extent to which two articles cite the same kind of scholarly article. Direct elicitations (search terms are but one form) from the user can also be added to guide the recommendations. And meta-characteristics, for instance, page rank of a piece of scholarship, could be used to order recommendations.

Companies With Benefits or Potential Welfare Losses of Benefits

27 Sep

Many companies offer employees ‘benefits.’ These include paying for healthcare, investment plans, company gym, luncheons etc. (Just ask a Silicon Valley tech. employee for the full list.)

But why ‘benefits’? And why not cash?

A company offering a young man zero down health care plan seems a bit like within-company insurance. Post Obamacare—it also seems a bit unnecessary. (My reading of Obamacare is that it just mandates companies pay for the healthcare but doesn’t mandate that how they pay for it. So cash payments ought to be ok?)

For investment, the reasoning strikes me as thinner still. Let people decide what they want to do with their money.

In many ways, benefits look a bit like ‘gifts’—welfare reducing but widespread.

My recommendation: just give people cash. Or give them an option to have cash.

The Technology Giveth and The Technology Taketh

27 Sep

Riker and Ordershook formalized the voting calculus as:

pb + d > c

where,
p = probability of vote ‘mattering’
b = size of the benefit
d = sense of duty
c = cost of voting

They argued that if pb + d exceeds c, people will vote. Otherwise not.

One can generalize this simple formalization for all political action.

A fair bit of technology has been invented to reduce c — it is easier than ever to follow the news, to contact your representative, etc. However, for a particular set of issues, if you reduce c for everyone, you are also reducing p. For as more people get involved, less does the voice of any single person matter. (There are still some conditionalities that I am eliding over. For instance, reduction in c may matter more for people who are poorer etc. and may have an asymmetric impact.)

Technologies invented to exploit synergy, however, do not suffer the same issues. Think Wikipedia, etc.

Beyond Anomaly Detection: Supervised Learning from `Bad’ Transactions

20 Sep

Nearly every time you connect to the Internet, multiple servers log a bunch of details about the request. For instance, details about the connecting IPs, the protocol being used, etc. (One popular software for collecting such data is Cisco’s Netflow.) Lots of companies analyze this data in an attempt to flag `anomalous’ transactions. Given the data are low quality—IPs do not uniquely map to people, information per transaction is vanishingly small—the chances of building a useful anomaly detection algorithm using conventional unsupervised methods are extremely low.

One way to solve the problem is to re-express it as a supervised problem. Google, various security firms, security researchers, etc. everyday flag a bunch of IPs for various nefarious activities, including hosting malware (passive DNS), scanning, or actively . Check to see if these IPs are in the database, and learn from the transactions that include the blacklisted IPs. Using the model, flag transactions that look most similar to the transactions with blacklisted IPs. And validate the worthiness of flagged transactions with the highest probability of being with a malicious IP by checking to see if the IPs are blacklisted at a future date or by using a subject matter expert.

Bad Remedies for Bad Science

3 Sep

Lack of reproducibility is a symptom of science in crisis. An eye-catching symptom to be sure, but hardly the only one vying for attention. Recent analyses suggest that nearly two-thirds of the (relevant set of) articles published in prominent political science journals condition on post-treatment variables (see here.) Another analysis suggests that half of the relevant set of articles published in prominent neuroscience journals interpret the difference between a significant and non-significant result as evidence that the difference between the two is significant (see here). What is behind this? My guess: poor understanding of statistics, poor editorial processes, and poor strategic incentives.

  1. Poor understanding of statistics among authors. It likely stems from:
  2. Poor understanding of statistics among editors, reviewers, etc. This creates two problems:
    • Cannot catch inevitable mistakes: Whatever the failings of authors, they aren’t being caught during the review process. (It would be good to know how often reviewers are the source of bad recommendations.)
    • Creates Bad Incentives: If editors are misinformed, say to look for significant results, authors will be motivated to deliver to that.
      • If you know what is the right thing to do but know that there is a premium for doing the wrong thing (see the second point of the second point), you may use a lack of transparency as a way to cater to bad incentives.
  3.  Psychological biases:
    • Motivated Biases: Scientists are likely biased toward their own theories. They wish them to be true. This may lead to motivated skepticism and scrutiny. The same principle likely applies to reviewers who catch on to the storytelling and give a wider pass to stories that jive with them.
  4. Production Pressures: Given production pressures, there is likely sloppiness in what is produced. For instance, it is troubling how often retracted articles are cited after the publication of the retraction notice.
  5. Weak Penalties for Being Sloppy: Without easy ways for others to find mistakes, it is easier to be sloppy.

Given these problems, the big solution I can think of is improving training. Another would be programs that highlight some of the psychological biases and drive clarity on the purpose of science. The troubling part is that the most commonly proposed solution is transparency. As Gelman points out, transparency is neither necessary nor sufficient to prevent the “statistical and scientific problems” that underlie “the scientific crisis” because:

  1. Emphasis on transparency would merely mean transparent production of noise (last column on page 38).
  2. Transparency makes it a tad easier to spot errors but doesn’t provide incentives to learn from errors. And a thumb rule is to fix upstream issues than downstream issues.

Gelman also points out the negative externalities of transparency as a be-all fix. When you focus on transparency, secrecy is conflated with dishonesty.

Some Aspects of Online Learning Re-imagined

3 Sep

Trevor Hastie is a superb teacher. He excels at making complex things simple. Till recently, his lectures were only available to those fortunate enough to be at Stanford. But now, he teaches a free introductory course on statistical learning via the Stanford online learning platform. (I browsed through the online lectures — they are superb.)

Online learning, pregnant with rich possibilities, however, has met its cynics and its first failures. The proportion of students who drop-off after signing up for these courses is astonishing (but so are the sign-up rates). Some preliminary data suggest that compared to some face-to-face teaching, students enrolled in online learning don’t perform as well. But these are early days, and much can be done to refine the systems and promote the interests of thousands upon thousands of teachers who have much to contribute to the world. In that spirit, I offer some suggestions.

Currently, Coursera, edX, and similar such ventures mostly attempt to replicate an off-line course online. The model is reasonable but doesn’t do justice to the unique opportunities of teaching online. 

The Online Learning Platform: For Coursera etc., to be true learning platforms, they need to provide rich APIs for allowing the development of both learning and teaching tools (and easy ways for developers to monetize these applications). For instance, professors could get access to visualization applications, and students could get access to applications that put them in touch with peers interested in peer-to-peer teaching. The possibilities are endless. Integration with other social networks and other communication tools at the discretion of students are obvious future steps.

Learning from Peers: The single teacher model is quaint. It locks out contributions from other teachers. And it locks out teachers from potential revenues. We could turn it into a win-win. Scout ideas and materials from teachers and pay them for their contributions, ideally as a share of the revenue — so that they have a steady income.

The Online Experimentation Platform: So many researchers in education and psychology are curious and willing to contribute to this broad agenda of online learning. But no protocols and platforms have been established to exploit this willingness. The online learning platforms could release more data at one end. And at the other end, they could create a platform for researchers to option ideas that the community and company can vet. Winning ideas can form the basis of research that is implemented on the platform.

Beyond these ideas, there are ideas to do with courses that enhance students’ ability to succeed in these courses. Providing students ‘life skills’ either face-to-face or online, teaching them ways to become more independent, etc., can allow them to better adapt to the opportunities and challenges of learning online.

Stemming the Propagation of Error

2 Sep

About half of the (relevant set of) articles published in neuroscience mistake difference between a significant result and an insignificant result as evidence for the two being significantly different (see here). It would be good to see if the articles that make this mistake, for instance, received fewer citations post publication of the article revealing the problem. If not, we probably have more work to do. We probably need to improve ways by which scholars are alerted about the problems in articles they are reading (and interested in citing). And that may include building different interfaces for the various ‘portals’ (Google Scholar, JSTOR etc., and journal publishers) that scholars heavily use. For instance, creating UIs that thread reproduction attempts, retractions, articles finding serious errors within the original article, etc.

Unlisted False Negatives: Are 11% Americans Unlisted?

21 Aug

A recent study by Simon Jackman and Bradley Spahn claims that 11% of Americans are ‘unlisted.’ (The paper has since been picked up by liberal media outlets like the Think Progress.)

When I first came across the paper, I thought that the number was much too high for it to have any reasonable chance of being right. My suspicions were roused further by the fact that the paper provided no bounds on the number — no note about measurement error in matching people across imperfect lists. A galling omission when the finding hinges on the name matching procedure, details of which are left to another paper. What makes it to the paper is this incredibly vague line: “ANES collects …. bolstering our confidence in the matches of respondents to the lists.” I take that to mean that the matching procedure was done with the idea of reducing false positives. If so, the estimate is merely an upper bound on the percentage of Americans who could be unlisted. That isn’t a very useful number.

But reality is a bit worse. To my questions about false positive and negative rates, Bradley Spahn responded on Twitter, “I think all of the contentious cases were decided by me. What are my decision-theoretic properties? Hard to say.” That line covers one of the most essential details of the matching procedure, a detail they say the readers can find “in a companion paper.” The primary issue is subjectivity. But not taking adequate account of the relevance of ‘decision theoretic’ properties to the results in the paper grates.

That’s Smart! What We Mean by Smartness and What We Should

16 Aug

Many people conceive of intelligence as a CPU. To them, being more intelligent means having a faster CPU. And this is despairing as the clock speed is largely fixed. (Prenatal and childhood nutrition can make a sizable difference, however. For instance, iodine deficiency in children causes mental retardation.)

But people misconceive intelligence. Intelligence is not just the clock speed of the CPU. It is also the short-term cache and the OS.

Clock speed doesn’t matter much if there isn’t a good-sized cache. The size of short-term memory matters enormously. And the good news is that we can expand it with effort.

A super fast CPU with a large cache is still only as good as the operating system. If people know little or don’t know how to reason well, they generally won’t be smart. Think of the billions of people who came before we knew how to know (science). Some of those people had really fast CPUs. But many of them weren’t able to make much progress on anything.

The chances an ignoramus who doesn’t know correlation isn’t causation will come across as stupid are also high. In fact, we often mistake being knowledgeable and possessing rules of how to reason well for being intelligent. Though it goes without saying, it is good to say it: how much we know and knowledge of how to reason better is in our control.

Lastly, people despair because they mistake skew for the variance. People believe there is a lot of variance in processing capacity. My sense is that variance in the processing capacity is low, and the skew high. In layman’s terms, most people are as smart as the other with very few very bright people. None of this is to say that the little variance that exists is not consequential.

Toward a Better OS

Ignorance of relevant facts limits how well we can reason. Thus, increasing the repository of relevant facts at hand will help you reason smarter. If you don’t know about something, that is ok. Today, the world’s information is at your fingertips. It will take you some time to go through things but you can become informed about more things than you think possible.

Besides knowledge, there are some ‘frameworks’ of how to approach a problem. For instance, management consultants have something called MECE. This ‘framework’ can help you reason better about a whole slew of problems. There are likely others.

Besides reasoning frameworks, there are simple rules that can help you reason better. You can look up books devoted to common errors in thinking, and use those to derive the rules. The rules can look as follows:

  1. Correlation is not the same as causation
  2. Don’t select on the dependent variable. What I call the ‘7 Habits of Successful People’ rule.
  3. Replace categorical thinking with continuous where possible, and be precise. For e.g., rather than claim that ‘there is a risk’, quantify the risk. Or replace the word possibility with probability where applicable.
  4. Have a better grasp of your own ignorance using some of the tricks described here.
  5. The tree of inference starts with the question. Think hard about what data you would need to answer the question well. And then what data you have. And then calibrate your assessments about the answer based on the difference between the data you would have liked to have and the data you have.

Getting a Measure of Measures of Selective Exposure

24 Jul

Ideally, we would like to be able to place ideology of each bit of information consumed in relation to the ideological location of the person. And we would like a time-stamped distribution of the bits consumed. We can then summarize various moments of that distribution (or the distribution of ideological distances). And that would be that. (If we were worried about dimensionality, we would do it by topic.)

But lack of data means we must change the estimand. We must code each bit of information as merely uncongenial or uncongenial. This means taking directionality out of the equation. For a Republican at a 6 on a 1 to 7 liberal to conservative scale, consuming a bit of information at 5 is the same as consuming a bit at 7.

The conventional estimand then is a set of two ratios: (Bits of politically congenial information consumed)/(All political information) and (Bits of uncongenial information)/(All political information consumed). Other reasonable formalizations exist, including the difference between congenial and uncongenial. (Note that the denominator is absent, and reasonably so.)

To estimate these quantities, we must often make further assumptions. First, we must decide on the domain of political information. That domain is likely vast and increasing by the minute. We are all producers of political information now. (We always were but today we can easily access political opinions of thousands of lay people.) But see here for some thoughts on how to come up with the relevant domain of political information from passive browsing data.

Next, people often code ideology at the level of ‘source.’ The New York Times is ‘independent’ or ‘liberal’ and ‘Fox’ simply ‘conservative’ or perhaps more accurately ‘Republican-leaning.’ (Continuous measures of ideology — as estimated by Groseclose and Milyo or Gentzkow and Shapiro — are also assigned at the source level.) This is fine except that it means coding all bits of information consumed from a source as the same. And there are some attendant risks. We know that not all NYT articles are ‘liberal.’ In fact, we know much of it is not even political news. A toy example of how such measures can mislead:

Page Views: 10 Fox, 10 CNN. Est: 10/20
But say Fox Pages 7R, 3D and CNN 5R, 5D
Est: 7/10 + 5/10 = 12/20

If the measure of ideology is continuous, there are still some risks. If we code all page views as the mean ideology of the source, we assume that the person views a random sample of pages on the source. (Or some version of that.) But that is too implausible an assumption. It is much more likely that a liberal reading the NYT likely stays away from the David Brooks’ columns. If you account for such within source self-selection, selective exposure measures based on source level coding are going to be downwardly biased — that is find people as less selective than they are.

Discussion until now has focused on passive browsing data, eliding over survey measures. There are two additional problems with survey measures. One is about the denominator. Measures based on limited choice experiments like ones used by Iyengar and Hahn 2009 are bad measures of real-life behavior. In real life, we just have far more choices. And inferences from such experiments can at best recover ordinal rankings. The second big problem with survey measures is ‘expressive responding.’ Republicans indicating they watch Fox News not because they do but because they want to convey they do.

Reviewing the Peer Review

24 Jul

Update: Current version is posted here.

Science is a process. And for a good deal of time, peer review has been an essential part of the process. Looked independently by people with no experience with it, it makes a fair bit of sense. For there is only one well-known way of increasing the quality of an academic paper — additional independent thinking. And who better than engaged, trained colleagues.

But this seemingly sound part of the process is creaking. Today, you can’t bring two academics together without them venting their frustration about the broken review system. The plaint is that the current system is a lose-lose-lose. All the parties — the authors, the editors, and the reviewers — lose lots and lots of time. And the change in quality as a result of suggested changes is variable, generally small, and sometimes negative. Given how critical the peer review is in the scientific production, it deserves closer attention, preferably with good data.

But data on peer review aren’t available to be analyzed. Thus, some anecdotal data. Of the 80 or so reviews that I have filed and for which editors have been kind enough to share comments by other reviewers, two things have jumped at me: a) hefty variation in quality of reviews, b) and equally hefty variation in recommendations for the final disposition. It would be good to quantify the two. The latter is easy enough to quantify.

Reliability of the review process has implications for how many recommenders we need to reliably accept or reject the same article. Counter-intuitively, increasing the number of reviewers per manuscript may not increase the overall burden of reviewing. Partly because everyone knows that the review process is so noisy, there is an incentive to submit articles that people know aren’t good enough. Some submitters likely reason that there is a reasonable chance of a `low quality’ article being accepted at top places. Thus, low reliability peer review systems may actually increase the number of submissions. Greater number of submissions, in turn, increases editors’ and reviewer’s load and hence reduces the quality of reviews, and lowers the reliability of recommendations still further. It is a vicious cycle. And the answer may be as simple as making the peer review process more reliable. At any rate, these data ought to be publicly released. Side-by-side, editors should consider experimenting with number of reviewers to collect more data on the point.

Quantifying the quality of reviews is a much harder problem. What do we mean by a good review? A review that points to important problems in the manuscript and, where possible, suggests solutions? Likely so. But this is much trickier to code. But perhaps there isn’t as much a point to quantifying this. What is needed perhaps is guidance. Much like child-rearing, there is no manual for reviewing. There really should be. What should reviewers attend to? What are they missing? And most critically, how do we incentivize this process?

When thinking about incentives, there are three parties whose incentives we need to restructure — the author, the editor, and the reviewer. Authors’ incentives can be restructured by making the process less noisy, as we discuss above. And by making submissions costly. All editors know this: electronic submissions have greatly increased the number of submissions. (It would be useful to study what the consequence of move to electronic submission has been on quality of articles.) As for the editors — if the editors are not blinded to the author (and the author knows this), they are likely to factor in the author’s status in choosing the reviewers, in whether or not to defer to the reviewers’ recommendations, and in making the final call. Thus we need triple blinded pipelines.

Whether or not the reviewer’s identity is known to the editor when s/he is reading the reviewer’s comments also likely affects reviewer’s contributions — in both good and bad ways. For instance, there is every chance that junior scholars in trying to impress editors file more negative reviews than they would if they would if they knew that the editor had no way of tying the identity of the reviewer with the review. Beyond altering anonymity, one way to incentivize reviewers would be to publish the reviews publicly, perhaps as part of the paper. Just like online appendices, we can have a set of reviews published online with each article.

With that, some concrete suggestions beyond the ones already discussed. Expectedly — given they come from a quantitative social scientist — they fall into two broad brackets: releasing and learning from the data already available, and collecting more data.

Existing Data

A fair bit of data can be potentially released without violating anonymity. For instance,

  • Whether manuscript was desk rejected or not
  • How many reviewers were invited
  • Time taken by each reviewer to accept (NA for those from whom you never heard)
  • Total time in review for each article (till R and R or reject) (And separate set of column for each revision)
  • Time taken by each reviewer
  • Recommendation by each reviewer
  • Length of each review
  • How many reviewers did the author(s) suggest?
  • How often were suggested reviewers followed-up on?

In fact, much of the data submitted in multiple-choice question format can probably be released easily. If editors are hesitant, a group of scholars can come together and we can crowdsource collection of review data. People can deposit their reviews and the associated manuscript in a specific format to a server. And to maintain confidentiality, we can sandbox these data allowing scholars to run a variety of pre-screened scripts on it. Or else journals can institute similar mechanisms.

Collecting More Data

  • In economics, people have tried to institute shorter deadlines for reviewers to effect of reducing review times. We can try that out.
  • In terms of incentives, it may be a good idea to try out cash but also perhaps experimenting with a system where reviewers are told that their comments will be public. I, for one, think it would lead to more responsible reviewing. It would be also good to experiment with triple-blind reviewing.

If you have additional thoughts on the issue, please propose them at: https://gist.github.com/soodoku/b20e6d31d21e83ed5e39

Here’s to making advances in the production of science and our pursuit for truth.

Where’s the Porn? Classifying Porn Domains Using a Calibrated Keyword Classifier

23 Jul

Aim: Given a very large list of unique domains, find domains carrying adult content.

In the 2004 comScore browsing data, for instance, there are about a million unique domains. Comparing a million unique domain names against a large database is doable. But access to such databases doesn’t often come cheap. So a hack.

Start with an exhaustive keyword search containing porn-related keywords. Here’s mine

breast, boy, hardcore, 18, queen, blowjob, movie, video, love, play, fun, hot, gal, pee, 69, naked, teen, girl, cam, sex, pussy, dildo, adult, porn, mature, sex, xxx, bbw, slut, whore, tit, pussy, sperm, gay, men, cheat, ass, booty, ebony, asian, brazilian, fuck, cock, cunt, lesbian, male, boob, cum, naughty

For the 2004 comScore data, this gives about 140k potential porn domains. Compare this list to the approximately 850k porn domains in the Shallalist. This leaves us with a list of 68k domains with uncertain status. Use one of the many URL classification APIs. Using Trusted Source API, I get about 20k porn and 48k non-porn.

This gives us the lower bound of adult domains. But perhaps much too low.

To estimate the false positives, take a large random sample (say 10,000 unique domains). Compare results from keyword search and eliminate using API to API search of all 10k domains. This will give you an estimate of the false positive rate. But you can learn from the list of false negatives to improve your keyword search. And redo everything. A couple of iterations can produce a sufficiently low false negative rate (false positive rate is always ~ 0). (For 2004 comScore data, a false negative rate of 5% is easily achieved.)

See PyDomains.

Where’s the news?: Classifying News Domains

23 Jul

We select an initial universe of news outlets (i.e., web domains) via the Open Directory Project (ODP, dmoz.org), a collective of tens of thousands of editors who hand-label websites into a classification hierarchy. This gives 7,923 distinct domains labeled as: news, politics/news, politics/media, and regional/news. Since the vast majority of these news sites receive relatively little traffic, to simplify our analysis we restrict to the one hundred domains that attracted the largest number of unique visitors from our sample of toolbar users. This list of popular news sites includes every major national news source, well-known blogs and many regional dailies, and
collectively accounts for over 98% of page views of news sites in the full ODP list (as estimated via our toolbar sample). The complete list of 100 domains is given in the Appendix.

From Filter Bubbles, Echo Chambers, and Online News Consumption by Flaxman, Goel and Rao.

When using rich browsing data, scholars often rely on ad hoc lists of domains to estimate consumption of certain kind of media. Using these lists to estimate consumption raises three obvious concerns – 1) Even sites classified as ‘news sites,’ such as the NYT, carry a fair bit of non-news 2) (speaking categorically) There is the danger of ‘false positives’ 3) And (speaking categorically again) there is a danger of ‘false negatives.’

FGR address the first concern by exploiting the URL structure. They exploit the fact that the URL of NY Times story contains information about the section. (The classifier is assumed to be perfect. But likely isn’t. False positive and negative rates for this kind of classification can be estimated using raw article data.) This leaves us with concern about false positives and negatives at the domain level. Lists like those published by DMOZ appear to be curated well-enough to not contain too many false-positives. The real question is about how to calibrate false negatives. Here’s one procedure. Take a large random sample of the browsing data (at least 10,000 unique domain names). Compare it to a large comprehensive database like Shallalist. Of the domains that aren’t in the database, query a URL classification service such as Trusted Source. (The initial step of comparing against Shallalist is to reduce the amount of querying.) Using the results, estimate the proportion of missing domain names (the net number of missing domain names is likely much much larger). Also estimate missed visitation time, page views etc.

Toward a Two-Sided Market on Uber

17 Jun

Uber prices rides based on the availability of Uber drivers in the area, the demand, and the destination. This price is the same for whoever is on the other end of the transaction. For instance, Uber doesn’t take into account that someone in a hurry may be willing to pay more for a quicker service. By the same token, Uber doesn’t allow drivers to distinguish themselves on price (Airbnb, for instance, allows this). It leads to a simpler interface but it produces some inefficiency—some needs go unmet, some drivers go underutilized, etc. It may make sense to try out a system that allows for bidding on both sides.

Rape in India

16 Jun

According to crime reports, in India, rape is about 15 times less common than in the US. A variety of concerns apply. For one, the definition of rape varies considerably. But differences aren’t always in the expected direction. For instance, Indian Penal Code considers sex under the following circumstance as rape: “With her consent, when the man knows that he is not her husband, and that her consent is given because she believes that he is another man to whom she is or believes herself to be lawfully married.” In 2013, however, the definition of rape under the Indian Penal Code was updated to what is generally about par with the international definitions except for two major exceptions: the above clause, and more materially, the continued exclusion of marital rape. For two, there are genuine fears about rape being yet more severely underreported in India.

Evidence from Surveys
Given that rape is underreported, anonymous surveys of people are better indicators of prevalence of rape. In the US, comparing CDC National Intimate Partner and Sexual Violence Survey data to FBI crime reports suggests that only about 6.6% of rapes are reported. (Though see also, comparison to National Crime Victimization Survey (NCVS) which suggests that one of every three rapes are reported.) In India, according to Lancet (citing numbers from American Journal of Epidemiology (Web Table 4)), “1% of victims of sexual violence report the crime to the police.” Another article based on data from the 2005 National Family Health Survey (NFHS) finds that the corresponding figure for India is .7%. (One odd thing about the article that caught my eye — The proportion of marital rape reported ought to be zero given the Indian Penal Code doesn’t recognize marital rape. But perhaps reports can still be made.) This pegs the rate of rape in India at either about 60% of the rate in the US (if CDC numbers are more commensurable to NFHS numbers) or at about three times as much (if NCVS numbers are more commensurable to NFHS numbers).

The article based on NFHS data also estimates that nearly 98% of rapes are committed by husbands. This compares to 26% in US according to NCVS. So one startling finding is that risk of rape for unmarried women is startlingly low in India. Or chances of being raped as a married woman, astonishingly high. Data also show that rapes by husbands in India are especially unlikely to be reported. The low risk of rape for unmarried women may be a consequence of something equally abhorrent — fear of sexual harassment, or because of patriarchal attitudes at home, Indian women may be much less likely to be outdoors than men. (It would be good to quantify this.)

p.s. From a comparative perspective, it is important to keep in mind that in the West, especially the US, most commuting to work is by car, and one would want to adjust for time walking on the street to derive commensurate numbers for some quantities.

Partisan Retrospection?: Partisan Gaps in Retrospection are Highly Variable

11 Jun

The difference between partisans’ responses on retrospection items is highly variable, ranging from over 40% to nearly 0. For instance, in 1988 nearly 30% fewer Democrats than Republicans reported that the inflation rate between 1980 and 1988 had declined. (It had.) However, similar proportions of Republicans and Democrats got questions about changes in the size of the budget deficit and defense spending between 1980 and 1988 right. The median partisan gap across 20 items asked in the NES over 5 years (1988, 1992, 2000, 2004, and 2008) was about 15 points (the median was about 12 points), and the standard deviation was about 13 points. (See the tables.) This much variation suggests that observed bias in partisans’ perceptions depends on a variety of conditioning variables. For one, there is some evidence to suggest that during severe recessions, partisans do not differ much in their assessment of economic conditions (See here.) Even when there are partisan gaps, however, they may not be real (see paper (pdf)).

Enabling FOIA: Getting Better Access to Government Data at a Lower Cost

28 May

Freedom of Information Act is vital for holding the government to account. But little effort has gone into building tools that enable fulfillment of FOIA requests. As a consequence of this underinvestment, fulfilling FOIA requests often means onerous, costly work for government agencies and long delays for the requesters. Here, I discuss a couple of alternatives. One tool that can prove particularly effective in lowering the cost of fulfilling FOIA requests is an anonymizer—a tool that detects proper nouns, addresses, telephone numbers, etc. and blurs them. This is easily achieved using modern machine learning methods. To ensure 100% accuracy, humans can quickly vet the suggestions by the algorithm along with ‘suspect words.’ Or a captcha like system that asks people in developing countries to label suspect words as nouns etc. can be created to further reduce costs. This covers one conventional solution. Another way to solve the problem would be to create a sandbox architecture. Rather than give requesters stacks of redacted documents, often unnecessary, one can allow people the right to run certain queries on the data. Results of these queries can be vetted internally. A classic example would be to allow people to query government emails for the total number of times porn domains are accessed via servers at government offices.

Optimal Cost Function When the Cost of Misclassification is Higher for the Customer than for the Business

15 Apr

Consider a bank making decisions about loans. For the bank, making lending decisions optimally means reducing prediction errors*(cost of errors) minus the cost of making predictions (Keeping things simple here). The cost of any one particular error — especially, denial of loan when eligible– is typically small for the bank, but very consequential for the applicant. So the applicant may be willing to pay the bank money to increase the accuracy of their decisions. Say, willing to compensate the bank for the cost of getting a person to take a closer look at the file. If customers are willing to pay the cost, accuracy rates can increase without reducing profits. (Under some circumstances, a bank may well be able to increase profits.) Customer’s willingness to pay for increasing accuracy is typically not exploited by the lending institutions. It may be well worth exploring it.

The Human and the Machine: Semi-automated approaches to ML

12 Apr

For a class of problems, a combination of algorithms and human input makes for the most optimal solution. For instance, three years ago software to recreate shredded documents that won the DARPA award used “human[s] [to] verify what the computer was recommending.” The insight is used in character recognition tasks. I have used it to create software for matching dirty data — the software was used to merge shape files with electoral returns at precinct level.

The class of problems for which human input proves useful has one essential attribute — humans produce unbiased, if error-prone, estimates for these problems. So for instance, it would be unwise to use humans for making the ‘last mile’ of lending decisions (see also this NYT article). (And that is something you may want to verify with training data.)