Big Data Algorithms: Too Complicated to Communicate?

11 Apr

“A decision is made about you, and you have no idea why it was done,” said Rajeev Date, an investor in data-science lenders and a former deputy director of Consumer Financial Protection Bureau

From NYT: If Algorithms Know All, How Much Should Humans Help?

The assertion that there is no intuition behind decisions made by algorithms strikes me as silly. So does the related assertion that such intuition cannot be communicated effectively. We can back out the logic for most algorithms. Heuristic accounts of the logic — e.g. which variables were important — can be given yet more easily. For instance, for inference from seemingly complicated-to-interpret methods such as ensemble methods, intuition for what variables are important can be gotten in the same way as it is gotten for methods like bagging. However, even when specific points are hard to convey, the meta-logic of the system can be explained to the end user.

What is true, however, is that it isn’t being done. For instance, WSJ covering Orion routing system at UPS reports:

“For example, some drivers don’t understand why it makes sense to deliver a package in one neighborhood in the morning, and come back to the same area later in the day for another delivery. …One driver, who declined to speak for attribution, said he has been on Orion since mid-2014 and dislikes it, because it strikes him as illogical.”

WSJ: At UPS, the Algorithm Is the Driver

Communication architecture is an essential part of all human focused systems. And what to communicate when are important questions that deserve careful thought. The default cannot be no communication.

The lack of systems that communicate intuition behind algorithms strikes me as a great opportunity. HCI people — make some money.

Estimating Hillary’s Missing Emails

11 Apr

Note:

55000/(365*4) ~ 37.7. That seems a touch low for Sec. of state.

Caveats:
1. Clinton may have used more than one private server
2. Clinton may have sent emails from other servers to unofficial accounts of other state department employees

Lower bound for missing emails from Clinton:

  1. Take a small weighted random sample (weighting seniority more) of top state department employees.
  2. Go through their email accounts on the state dep. server and count # of emails from Clinton to their state dep. addresses.
  3. Compare it to # of emails to these employees from the Clinton cache.

To propose amendments, go to the Github gist

Some Hard Feelings: Feelings Towards Some Racial and Ethnic Groups in 4 Countries

8 Aug

According to YouGov surveys in Switzerland, Netherlands and Canada, and the 2008 ANES in the US, Whites, on average, in each of the four countries feel fairly coldly — giving an average thermometer rating of less than 50 on a 0 to 100 scale — toward Muslims, and people from Muslim-majority regions (Feelings towards different ethnic, racial, and religious groups). However, in Europe, Whites’ feelings toward Romanians, Poles, and Serbs and Kosovars are scarcely any warmer, and sometimes cooler. Meanwhile, Whites feel relatively warmly towards East Asians.

Liberal Politicians are Referred to More Often in News

8 Jul

The median Democrat referred to in television news is to the left of the House Democratic Median, and the median Republican politician referred to is to the left of the House Republican Median.

Click here for the aggregate distribution.

And here’s a plot of top 50 politicians cited in news. The plot shows a strong right skewed distribution with a bias towards executives.

Data:
News data: UCLA Television News Archive, which includes closed-caption transcripts of all national, cable and local (Los Angeles) news from 2006 to early 2013. In all, there are 155,814 transcripts of news shows.

Politician data: Database on Ideology, Money in Politics, and Elections (see Bonica 2012).

Note:
Taking out data from local news channels or removing Obama does little to change the pattern in the aggregate distribution.

(No) Value Added Models

6 Jul

This note is in response to some of the points raised in the Agnoff Lecture by Ed Haertel.

The lecture makes two big points:
1) Teacher effectiveness ratings based on current Value Added Models are ‘unreliable.’ They are actually much worse than just unreliable; see below.
2) Simulated counterfactuals of gains that can be got from ‘firing bad teachers’ are upwardly biased.

Three simple tricks (one discussed; two not) that may solve some of the issues:
1) Estimating teaching effectiveness: Where possible, random assignment of children to classes. I would only do within school comparisons. Inference will still not be clean (SUTVA violations, though they can be dealt with). Simply cleaner.

2) Experiment with teachers. Teach some teachers some skills. Estimate the impact. Rather than teacher level VAM, do a skill level VAM. Teachers = sum of skills + idiosyncratic variation.

3) For current VAMs: To create better student level counterfactuals, use modern ML techniques (SVM, Neural Networks..), lots of data (past student outcomes, past classmate outcomes etc.), cross-validate to tune. Have a good idea about how good the prediction is. The strategy may be applicable to other venues.

Other points:
1) Haertel says, “Obviously, teachers matter enormously. A classroom full of students with no teacher would probably not learn much — at least not much of the prescribed curriculum.” A better comparison perhaps would be to self-guided technology. My sense is that as technology evolves, teachers will come up short in a comparison between teachers and advanced learning tools. In most of the third world, I think it is already true.

2) It appears no model for calculating teacher effectiveness scores yields identified estimates. And it appears we have no clear understanding of the nature of bias. Pooling biased estimates over multiple years doesn’t recommend itself to me as a natural fix to this situation. And I don’t think calling this situation as ‘unreliability’ of scores is right. These scores aren’t valid. The fact that pooling across years ‘works’ may suggest issues are smaller. But then again, bad things may be happening to some kinds of teachers, especially if people are doing cross-school comparisons.

3) Fade-out concern is important given the earlier 5*5 =25 analysis. My suspicion would be that attenuation of effects varies depending on when the timing of the shock. My hunch would be that shocks at an earlier age matter more – they decay slower.

(No) Missing daughters of Indian Politicians

29 Jun

Indian politicians get a bad rap. They are thought to be corrupt, inept, and sexist. Here we check whether there is prima facie evidence for sex-selective abortion.

According to data on the Indian Government ‘Archive’, 15th Lok Sabha members (csv) had, in all, 696 sons and 666 daughters for a sex ratio of 957 females to 1000 males. Progeny of members from states with the most skewed sex ratios (Punjab, Haryana, Jammu and Kashmir, and Haryana) had a surprisingly healthy sex ratio of 1245 females to 1000 males. Sex ratios of children of BJP and INC members were 930/1000 and 965/1000 respectively. Rajya Sabha members (csv) had 271 sons and 272 daughters for a sex ratio of 1003 females to 1000 males. Not only was there little evidence of sex-selective abortion, data also suggest that fertility rates were modest. Lok Sabha members had on average 2.5 kids while members of Rajya Sabha had on average 2.2 kids.

Github repository.

p.s. In 2023, I redid the analysis with new official data from 12–17th LS. Results here.

Impact of selection bias in experiments where people treat each other

20 Jun

Selection biases in the participant pool generally have limited impact on inference. One way to estimate population treatment effect from effects estimated using biased samples is to check if treatment effect varies by ‘kinds of people’, and then weight the treatment effect to population marginals. So far so good.

When people treat each other, selection biases in participant pool change the nature of the treatment. For instance, in a Deliberative Poll, a portion of the treatment is other people. Naturally then, the exact treatment depends on the pool of people. Biases in the initial pool of participants mean treatment is different. For inference, one may exploit across group variation in composition.

A Quick Scan: From Paper to Digital

28 May

Additional Note: See this presentation on paper to digital (pdf).

There are two ways of converting paper to digital data: ‘human OCR and input,’ and machine-assisted. Here are some useful pointers about the latter.

Scanning

Since the success of so much of what comes after depends on the quality of the scanned document, invest a fair bit of effort in obtaining high-quality scans. If the paper input is in the form of a book, and if the book is bound, and especially if it is thick, it is hard to copy text close to the spine without destroying the spine. If the book can be bought at a reasonable price, do exactly that – destroy the spine – cut the book and then scan. Many automated scanning services will cut the spine by default. Other things to keep in mind: scan at a reasonably high resolution (since storage is cheap, go for at least 600 dpi), and if choosing PDF as an output option, see if the scan can be saved as a “searchable PDF.” (So scanning + OCR in one go.)

An average person working without too much distraction can scan 60-100 images per hour. If two pages of the book can be scanned at once, this means 120-200 pages can be scanned in an hour. Assuming you can hire someone to scan for $10/hr, it comes to 12-20 pages per dollar, which translates to 5 to 8 cents per page. However, scanning manually is boring, and people who are punctilious about quality page after page are far and few between. Relying on automated scanning companies may be better. And cheaper. 1dollarscan.com charges 2 cents per page with OCR. But there is no free lunch. Most automated scanning services cut the spines of the book, and many places don’t send back the paper copy for reasons to do with copyright. So you may have to factor in the cost of the book. And typically, the scanning services do all or nothing. You can’t give directions to scan the first 10 pages, followed by middle 120, etc. Thus per relevant page costs may exceed those of manual scanning.

Scan to Text

If the text is clear and laid out simply, most commonly available OCR software will do just fine. Acrobat Professional’s own facility for recognizing text in images, found under ‘tools,’ is reasonable. Acrobat also notes words that it is unsure about – it calls them ‘suspects’; you can click through the line up of ‘suspects,’ correcting them as needed. The interface for making corrections is suboptimal, but it is likely to prove satisfactory for small documents with few errors.

Those without ready access to Adobe Professional can extract text from `searchable PDF’ using xpdf or any of the popular programming languages. In Python, pyPdf (see script) or pdfminer (or other libraries) are popular. If the document is a set of images, one can use libraries based on Tesseract (see script). PDF documents need to be converted to images using Ghostscript or similar such rasterization software before being fed to Tesseract.

But what if the quality of scans is poor or the page layout complicated? First, try enhancing images – fixing orientation, using filters to enhance readability, etc. This process can be automated if the images are distorted in similar ways. Second, try extracting boundary boxes for columns/paragraphs and words/characters (position/size) and font styles (name/size) by choosing XML/HTML as the output format. This information can be later exploited for aligning etc. However, how much you gain from extracting style and boundary box information depends heavily on the quality of the original pdf. For low quality pdfs, mislabeling of font size and style can be common, which means the information cannot be used reliably. Third, explore training the OCR. Tesseract can be trained to improve OCR though training it isn’t straightforward. Fourth, explore good professional OCR engines such as Abbyy FineReader (See R Package connecting to Abbyy FineReader API). OCR in AbbyyFine can be easily improved by adding training data, and tuning various options for identifying the proper ‘area order’ (which text area follows which, which portion of the page isn’t part of text area etc.).

Post-processing: Correcting Errors

The OCR process makes certain kinds of errors. For instance, ‘i’ may be confused for an ‘l’ or for a ‘pipe.’ And these errors are often repeated. One consequence of these errors is that some words are misspelled systematically. One way to deal with such errors is to simply search and replace particular strings (see script). When using search and replace, it pays to be mindful of problems that may ensue from searching and replacing short strings. For instance, replacing ‘lt’ with ‘it’ may mean converting ‘salt’ to ‘sait.’

Note: For a more general account on matching dirty data, read this article.

It is typically useful to script a search and replace script alongside a database of search terms and terms you propose to replace them with. For one it allows you to apply the same set of corrections to many documents. For two, it can be easily amended and re-rerun. While writing these scripts (see script), it is important to keep issues to do with text encoding in mind; OCR yields some ligatures (e.g., fi) and some other Unicode characters.

Searching and replacing particular strings can prove time-consuming as the same word can often be misspelled in tens of different ways. For instance, in a recent project, the word “programming” was misspelled in no less than 28 ways. The easiest extension to this algorithm is to search for patterns. For instance, some number of sequential errors at any point in the string. One can extend it to include various ‘edit distances’, e.g., Levenshtein distance, which is the number of characters you need to switch to convert one word to another, allowing the user to handle non-sequential errors (see script). Again the trick is to keep the length of the string in mind as false positives may abound. One can do that by choosing a metric that factors in the size of the string and the number of errors. For instance, a metric like (Levenshtein distance)/(size of the original string). Lastly, rather than use edit distance, one can apply ‘better’ pattern matching algorithms, such as Ratcliff/Obershelp.

Spell checks, a combination of pattern matching libraries and databases (dictionaries), are another common way of searching and replacing strings. There are various freely available spell-check databases (to be used with your own pattern matching algorithm) and libraries, including pyEnchant for Python. One can even use VB to call MS-Word’s fairly reasonable spell checking functions. However, if the document contains lots of unique proper nouns, spell check is liable to create more problems than it solves. One way to reduce these errors is to (manually) amend the database. Another is to limit corrections to words of a certain size, or to cases where the suggested words and the words in text differ only by certain kinds of letters (or non-letters, ‘pipe’ for the letter l). One can also leverage ‘Google suggest’ to fix spelling errors. Lastly, spell checks built for particular OCR errors, such as ocrspell, can also be used. If these methods still yield too many false corrections, one can go for a semi-automated approach: use spell-checks to harvest problematic words and recommended replacements and then let people pick the right version. A few tips for creating human-assisted spell check versions: eliminate duplicates, and provide the user with neighboring words (2 words before and after can work for some projects). Lastly, one can use M-Turk to iteratively proof-read the document (see TurkIt).

Error Free Multi-dimensional Thinking

1 May

Some recent research suggests that Americans’ policy preferences are highly constrained, with a single dimension able to correctly predict over 80% of the responses (see Jessee 2009, Tausanovitch and Warshaw 2013). Not only that, adding a new (orthogonal) dimension doesn’t improve prediction success by more than a couple of percentage points.

All this flies in the face of conventional wisdom in American Politics, which is roughly antipodal to the new view: most people’s policy preferences are unstructured. In fact, many people don’t have any real preferences on many of the issues (`non-preferences’). Evidence that is most often cited in support of this view comes from Converse – weak correlation between preferences across measurement waves spanning two years (r ~ .4 to .5), and even lower within wave cross-issue correlations (r ~ .2).

What explains this double disagreement — over the authenticity of preferences, and over the structuration of preferences?

First, the authenticity of preferences. When reports of preferences change across waves, is it a consequence of attitude change or non-preferences or measurement error? In response to concerns about long periods between test-retest – which allowed for opinions to genuinely change – researchers tried shorter time periods. Correlations were notably stronger (r ~ .6 to .9)(see Brown 1970). But the sheen of these healthy correlations was worn off by concerns that stability was merely an artifact of people remembering and reproducing what they put down last time.

Redemption of correlations over longer time periods came from Achen (1975). While few of the assumptions behind the redemption are correct – notably uncorrelated errors (across individuals, waves, etc.) – for inferences to be seriously wrong, much has to go wrong. More recently, and, subject to validation, perhaps more convincingly, work by Dean Lacy suggests that once you take out the small number of implausible transitions between waves – those from one end of the scale to another – cross-wave correlations are fairly healthy. (This is exactly opposite to the conclusion Converse came to based on a Markov model; he argued that aside from a few consistent responses, rest of the responses were mostly noise.) Much simpler but informative tests are still missing. For instance, it seems implausible that lots of people who hold well-defined preferences on an issue would struggle to pick even the right side of the scale when surveyed. Tallying stability of dichotomized preferences would be useful.

Some other purported evidence for the authenticity of preferences has come from measurement error models that rest upon more sizable assumptions. These models assume an underlying trait (or traits) and pool preferences over disparate policy positions (see, for instance, Ansolabehere, Rodden, and Snyder 2006 but also Tausanovitch and Warshaw 2013). How do we know there is an underlying trait? That isn’t clear. Generally, it is perfectly okay to ask whether preferences are correlated, less so to simply assume that preferences are structured by an unobserved underlying mental construct.

With the caveat that dimensions may not reflect mental constructs, we next move to assessing claims about the dimensionality of preferences. Differences between recent results and conventional wisdom about “constraint” may be simply due to increase in structuration of preferences over time. However, research suggests that constraint hasn’t increased over time (Baldassari and Gelman 2008). Perhaps more plausibly, dichotomization, which presumably reduces measurement error, is behind some of the differences. There are of course less ham-handed ways of reducing measurement error. For instance, using multiple items to measure preferences on a single policy, as psychologists often do. Since it cannot be emphasized enough, the lesson of past two paragraphs is: keep adjustments for measurement error, and measurement of constraint separate.

Analysis suggesting higher constraint may also be an artifact of analysts’ choices. Dimension reduction techniques are naturally sensitive to the pool of items. If a large majority of the items solicit preferences on economic issues (as in Tausanovitch and Warshaw 2013), the first principal component will naturally pick preferences on that dimension. Since the majority of the gains would come from correctly predicting a large majority of the items, gains in percentage correctly predicted would be poor at judging whether there is another dimension, say preferences on cultural issues. Cross-validation across selected large item groups (large enough to overcome idiosyncratic error) would be a useful strategy. And then again, gains in percentage correctly predicted over the entire population may miss subgroups with very different preference structures. For instance, Blacks and Catholics, who tend to be more socially conservative but economically liberal. Lastly, it is possible that preferences on some current issues (such as those used by Jessee 2009) may be more structured (by political conflict) than some old standing issues.

Gandhi And His Critics

7 Mar

Gandhi could never come to terms with the fact that he took leave from his dying father to have sex with his wife; his father died while he copulated. This episode produced a lifelong obsession with overcoming sexual desire and sanitation (or so Freudians will claim). Unrelatedly, Gandhi had unconventional (even bad, for their time) ideas about some other important matters – he wasn’t a fan of industrialization. All this is well known.

Was Gandhi a hidden, if not manifest, Hindu nationalist with an upper caste agenda? None too careful ideological hobbyist historians like Arundhati Roy will have you believe that. Do they have a point? No.

There are a great many similarities between how Jinnah and Ambedkar argued their cases with Gandhi. Not much distinguishes how Gandhi responded to each, often refusing to agree to the ‘facts’ that motivated their arguments, and always disagreeing with the claim that there was just one solution (the solution they proposed) to the problem they had identified. Gandhi saw both these leaders as too infatuated with their solutions (Gandhi was a touch too infatuated with his own solutions). He thought their solutions were irresponsible, if not illogical. Gandhi saw both Jinnah and Ambedkar eye to eye on the problems (we have good evidence on that), but never on the solutions. Does it make him opposed to their aims? No. His aims were the same as theirs, if not more ambitious.

(Upper caste) Hindus are never going to change. Replace ‘upper caste Hindus’ with any other group and you have a fair gist of the dominant understanding of people of ‘other groups.’ No easier caricature of humanity than this. If you believe that, the solution is obvious. Kill or split. Order restored. Except often enough order isn’t. The legacy of hatred lives on. The oppressed mutate into oppressors of their ‘own’ kind. (Who is your own is something we don’t think about enough about, relying often on simple heuristics. Is Lalu Prasad Yadav a well-wisher of all Yadavs? I think not. The same goes for enemies.)

You need more courage to see the greater truth – that people so thoughtlessly cruel can just as easily become defenders of enlightened ‘common sense’, that certain truths can be understood by people and that many will (and do) happily sacrifice their material advantage once they understand those facts. You also need courage to work from this greater truth. Creating change in people isn’t easy. Quite the opposite. But over the long run, it is perhaps the only solution.

But then, a lot of change (both positive and negative) has come incidentally, not as a result of conscious programs. Demographics along with particular democratic institutions in India have increased the political power of the lower castes (though like everybody, they haven’t always used it wisely). And economic liberalization, brought upon for different reasons, may have done more to erase caste boundaries than many other conscious attempts.

Capuchin Monkeys and Fairness: I Want At Least As Much As The Other

1 Dec

In a much heralded experiment, we see that a Capuchin monkey rejects a reward (food) for doing a task after seeing another monkey being rewarded with something more appetizing for doing the same task. It has been interpreted as evidence for our ‘instinct for fairness’. But there is more to the evidence. The fact that the monkey that gets the heftier reward doesn’t protest the more meager reward for the other monkey is not commented upon though highly informative. Ideally, any weakly reasoned deviation from equality should provoke a negative reaction. Monkeys who get the longer end of the stick, even when aware that others are getting the shorter end of the stick, don’t complain. Primates are peeved only when they are made aware that they are getting the short end of the stick. Not so much if someone else gets it. My sense is that it is true for most humans as well – people care far more about them holding the short end of the stick than others. It is thus incorrect to attribute such behavior to an ‘instinct for fairness’. A better attribution may be to the following rule: I want at least as much as the others are getting.

Sampling on M-Turk

13 Oct

In many of the studies that use M-Turk, there appears to be little strategy to sampling. A study is posted (and reposted) on M-Turk till a particular number of respondents take the study. If the pool of respondents reflects true population proportions, if people arrive in no particular order, and all kinds of people find the monetary incentive equally attractive, the method should work well. There is reasonable evidence to suggest that at least points 1 and 3 are violated. One costly but easy fix for the third point is to increase payment rates. We can likely do better.

If we are agnostic about variable on which we want precision, here’s one way to sample: Start with a list of strata, and their proportions in the population of interest. If the population of interest is sample of US adults, the proportions are easily known. Set up screening questions, and recruit. Raise price to get people in cells that are running short. Take simple precautions. For one, to prevent gaming, do not change the recruitment prompt to let people know that you want X kinds of people.

Bad Weather: Getting Data on Weather in a ZIP Code on a Particular Date

27 Jun

High-quality weather data are public. But they aren’t easy to make use of.

Some thoughts and some software for finding out the weather in a particular ZIP Code on a particular day (or a set of dates).

Some brief ground clearing before we begin. Weather data come from weather stations, which can belong to any of the five or more “networks,” each of which collects somewhat different data, sometimes label the same data differently and have different reporting protocols. The only geographic information that typically comes with weather stations is their latitude and longitude. By “weather,” we may mean temperature, rain, wind, snow, etc. and we may want data on these for every second, minute, hour, day, month, etc. It is good to keep in mind that not all weather stations report data for all units of time, and there can be a fair bit of missing data. Getting data at coarse time units like day, month, etc. typically involves making some decisions about what statistic is the most useful. For instance, you may want minimum and maximum for daily temperature and totals for rainfall and snow. With that primer, let’s begin.

We begin with what not to do. Do not use the NOAA web service. The API provides a straightforward way to get “weather” data for a particular ZIP Code for a particular month. Except, the requests often return nothing. It isn’t clear why. The documentation doesn’t say whether the search for the closest weather station is limited to X kilometers because without that, one should have data for all ZIP Codes and all dates. Nor does the API bother to return how far the weather station is from which it got the data, though one can get that post hoc using Google Geocoding API. However, given the possibility that the backend for the API would improve over time, here’s a script for getting the daily weather data, and hourly precipitation data.

On to what can be done. The “web service” that you can use is Farmer’s Almanac’s. Sleuthing using scripts that we discuss later reveal that The Almanac reports data from the NWS-USAF-NAVY stations (ftp link to the data file). And it appears to have data for most times though no information is provided on the weather station from which it got the data and the distance to the ZIP Code.

If you intend to look for data from GHCND, COOP, or ASOS, there are two kinds of crosswalks that you can create: 1) from ZIP codes to weather stations, and 2) from weather stations to ZIP Codes. I assume that we don’t have access to shapefiles (for census ZIP Codes) and that postal ZIP Codes encompass a geographic region. To create a weather station to ZIP Code crosswalk, web service such as Geonames or Google Geocoding API can be used. If the station lat,./long. is in the zip code, the distance comes up as zero. Otherwise the distance is calculated as distance from the centroid of the ZIP Code (see geonames script that finds 5 nearest ZIPs for each weather station). For creating a ZIP code to weather station crosswalk, we get centroids of each ZIP using a web service such as Google (or use already provided centroids from free ZIP databases). And then find the “nearest” weather stations by calculating distances to each of the weather stations. For a given set of ZIP Codes, you can get a list of closest weather stations (you can choose to get n closest stations, or say all weather stations within x kilometers radius, and/or choose to get stations from particular network(s)) using the following script. The output lists for each ZIP Code weather stations arranged by proximity. The task of getting weather data from the closest station is simple thereon—get data (on a particular set of columns of your choice) from the closest weather station from which the data are available. You can do that for a particular ZIP Code and date (and date range) combination using the following script.