What’s the Next Best Thing to Learn?

10 Oct

With Gaurav Gandhi

Recommendation engines are everywhere. These systems recommend what shows to watch on Netflix and what products to buy on Amazon. Since at least the Netflix Prize, the conventional wisdom is that recommendation engines have become very good. Except that they are not. Some of the deficiencies are deliberate. Netflix has made a huge bet on its shows, and it makes every effort to highlight its Originals over everything else. Some other efficiencies are a result of a lack of content. The fact is easily proved. How often have you done a futile extended search for something “good” to watch?

Take the above two explanations out, and still, the quality of recommendations is poor. For instance, Youtube struggles to recommend high-quality, relevant videos on machine learning. It fails on relevance because it either recommends videos that are too difficult or too easy. And it fails on quality—the opaqueness of explanation makes most points hard to understand. When I look back, most of the high-quality content on machine learning that I have come across is a result of subscribing to the right channels—human curation. Take another painful aspect of most recommendations: a narrow understanding of our interests. You watch a few food travel shows, and YouTube will recommend twenty others.

Problems

What is the next best thing to learn? It is an important question to ask. To answer it, we need to know the objective function. But the objective function is too hard to formalize and yet harder to estimate. Is it money we want, or is it knowledge, or is it happiness? Say we decide its money. For most people, after accounting for risk, the answer may be: learn to program. But what would the equilibrium effects be if everyone did that? Not great. So we ask a simpler question: what is the next reasonable unit of information to learn?

Meno’s paradox states that we cannot be curious about something that we know, and, using a Rumsfeld-ism, we cannot be curious about things we don’t know we don’t know. The domain of things we can be curious about hence are things we know that we don’t know. For instance, I know that I don’t know enough about dark matter. But the complete set of things we can be curious about includes things we don’t know we don’t know.

The set of things that we don’t know is very large. But that is not the set of information we can know. The set of relevant information is described by the frontier of our knowledge. The unit of information we are ready to consume is not a random unit from the set of things we don’t know but the set of things we can learn given what we know.  As I note above, a bunch of ML lectures that YouTube recommends are beyond me. 

There is a further constraint on ‘relevance.’ Of the relevant set, we are only curious about things we are interested in. But it is an open question about how we entice people to learn about things that they will find interesting. It is the same challenge Netflix faces when trying to educate people about movies people haven’t heard or seen.

Conditional on knowing the next best substantive unit, we care about quality.  People want content that will help them learn what they are interested in most efficiently. So we need to solve for the best source to learn the information.

Solutions

Known-Known

For things we know, the big question is how do we optimally retain things we know. It may be through Flashcards or what have you.

Exploring the Unknown

  1. Learn What a Person Knows (Doesn’t Know): The first step is in learning the set of information that the person doesn’t know is to learn what a person knows. The best way to learn what a person knows is to build a system that monitors all the information we consume on the Internet. 
  1. Classify. Next, use ML to split the information into broad areas. 
  1. Estimate The Frontier of Knowledge. To establish the frontier of knowledge, we need to put what people know on a scale. We can derive that scale by exploiting syllabi and class structure (101, 102, etc.) and associated content and then scaling all the content (Youtube video, books, etc.) by estimating similarity with the relevant level of content. (We can plausibly also establish a scale by following the paths people follow–videos that they start but don’t finish are good indications of being too easy or too hard, for instance.)

    We can also use tools like quizzes to establish the frontier, but the quizzes will need to be built from a system that understands how information is stacked.
  1. Estimate Quality of Content. Rank content within each topic and each level by quality. Infer quality through both explicit and implicit measures. Use this to build the relevant set.
  1. Recommend from the relevant set. Recommend a wide variety of content from the relevant set.

everywhere: meeting consumers where they are

1 Sep

Content delivery is not optimized for the technical stack used by an overwhelming majority of people. The technical stack of people who aren’t particularly tech-savvy, especially those who are old (over ~60 years), is often a messaging application like FB Messenger or WhatsApp. They currently do not have a way to ‘subscribe’ to Substack newsletters or podcasts or Youtube videos in the messaging application that they use (see below for an illustration of how this may look on the iPhone messaging app.) They miss content. And content producers have an audience hole.

Credit: Gaurav Gandhi

A lot of the content is distributed only via email or distributed within a specific application. There are good strategic reasons for that—you get to monitor consumption, recommend accordingly, control monetization, etc. But the reason why platforms like Substack, which enable independent content producers, limit distribution to email is not as immediately clear. It is unlikely a deliberate decision. It is likely a decision based on a lack of infrastructure that connects publishing to various messaging platforms. The future of messaging platforms is Slack—a platform that integrates as many applications as possible. As Whatsapp rolls out its business API, there is a potential to build an integration that allows producers to deliver premium content, leverage other kinds of monetization, like ads, and even build a recommendation stack. Eventually, it would be great to build that kind of integration for each of the messaging platforms, including iMessage, FB Messenger, etc.

Let me end by noting that there is something special about WhatsApp. No one has replicated the mobile phone-based messaging platform. And the idea of enabling a larger stack based on phone numbers remains unplumbed. Duo and FaceTime are great examples but there is potential for so much more. For instance, a calendar app. that runs on the mobile phone ID architecture.

Growth Funnels

24 Sep

You spend a ton of time building a product to solve a particular problem. You launch. Then, either the kinds of people whose problem you are solving never arrive or they arrive and then leave. You want to know why because if you know the reason, you can work to address the underlying causes.

Often, however, businesses only have access to observational data. And since we can’t answer the why with observational data, people have swapped the why with the where. Answering the where can give us a window into the why (though only a window). And that can be useful.

The traditional way of posing ‘where’ is called a funnel. Conventionally, funnels start at when the customer arrives on the website. We will forgo convention and start at the top.

There are only three things you should work to optimally do conditional on the product you have built:

  1. Help people discover your product
  2. Effectively convey the relevant value of the product to those who have discovered your product
  3. Help people effectively use the product

p.s. When the product changes, you have to do all three all over again.

One way funnels can potentially help is triage. How big is the ‘leak’ at each ‘stage’? The funnel on the top of the first two steps is: of the people who discovered the product, how many did we successfully communicate the relevant value of the product to? Posing the problem in such a way makes it seem more powerful than it is. To come up with a number, you generally only have noisy behavioral signals. For instance, the proportion of people who visit the site who sign up. But low proportions could be of large denominators—lots of people are visiting the site but the product is not relevant for most of them. (If bringing people to the site costs nothing, there is nothing to do.) Or it could be because you have a super kludgy sign-up process. You could drill down to try to get at such clues, but the number of potential locations for drilling remains large.

That brings us to the next point. Macro-triaging is useful but only for resource allocation kinds of decisions based on some assumptions. It doesn’t provide a way to get concrete answers to real issues. For concrete answers, you need funnels around concrete workflows. For instance, for a referral ‘product’ for AirBnB, one way to define steps (based on Gustaf Alstromer) is as follows:

  1. how many people saw the link asking people to refer
  2. of the people who saw the link, how many clicked on it
  3. of the people who clicked on it, how many invited people
  4. of the people who were invited, how many signed up (as a user, guest, host)
  5. of the people who signed up, how many made the first booking

Such a funnel allows you to optimize the workflow by experimenting with the user interface at each stage. It allows you to analyze what the users were trying to do but failed to do, or took a long time doing. It also allows you to analyze how optimally (from a company’s perspective) are users taking an action. For instance, people may want to invite all their friends but the UI doesn’t have a convenient way to import contacts from email.

Sometimes just writing out the steps in a workflow can be useful. It allows people to think if some steps are actually needed. For instance, is signing-in needed before we allow a user to understand the value of the product?

Quality Data: Plumbing ML Data Pipelines

6 Aug

What’s the difference between a scientist and a data scientist? Scientists often collect their own data, and data scientists often use data collected by other people. That is part jest but speaks to an important point. Good scientists know their data. Good data scientists must know their data too. To help data scientists learn about the data they use, we need to build systems that give them good data about the data. But what is good data about the data? And how do we build systems that deliver that? Here’s some advice (tailored toward rectangular data for convenience):

  • From Where, How Much, and Such
    • Provenance: how were each of the columns in the data created (obtained)? If the data are derivative, find out the provenance of the original data. Be as concrete as possible, linking to scripts, related teams, and such.
    • How Frequently is it updated
    • Cost per unit of data, e.g., a cell in rectangular data.
    Both, the frequency with which data are updated, and the cost per unit of data may change over time. Provenance may change as well: a new team (person) may start managing data. So the person who ‘owns’ the data must come back to these questions every so often. Come up with a plan.
  • What? To know what the data mean, you need a data dictionary. A data dictionary explains the key characteristics of the data. It includes:
    1. Information about each of the columns in plain language.
    2. How were the data were collected? For instance, if you conducted a survey, you need the question text and the response options (if any) that were offered, along with the ‘mode’, and where in a sequence of questions does this lie, was it alone on the screen, etc.
    3. Data type
    4. How (if at all) are missing values generated?
    5. For integer columns, it gives the range, sd, mean, median, n_0s, and n_missing. For categorical, it gives the number of unique values, what each label means, and a frequency table that includes n_missing (if missing can be of multiple types, show a row for each).
    6. The number of duplicates in data and if they are allowed and a reason for why you would see them. 
    7. Number of rows and columns
    8. Sampling
    9. For supervised models, store correlation of y with key x_vars
  • What If? What if you have a question? Who should you bug? Who ‘owns’ the ‘column’ of data?

Store these data in JSON so that you can use this information to validate against. Produce the JSON for each update. You can flag when data are some s.d. above below last ingest.

Store all this metadata with the data. For e.g., you can extend the dataframe class in Scala to make it so.

Auto-generate reports in markdown with each ingest.

In many ML applications, you are also ingesting data back from the user. So you need the same as above for the data you are getting from the user (and some of it at least needs to match the stored data). 

For any derived data, you need the scripts and the logic, ideally in a notebook. This is your translation function.

Where possible, follow the third normal form of databases. Only store translations when translation is expensive. Even then, think twice.

Lastly, some quality control. Periodically sit down with your team to see if you should see what you are seeing. For instance, if you are in the survey business, do the completion times make sense? If you are doing supervised learning, get a random sample of labels. Assess their quality. You can also assess the quality by looking at errors in classification that your supervised model makes. Are the errors because the data are mislabeled? Keep iterating. Keep improving. And keep cataloging those improvements. You should be able to ‘diff’ data collection, not just numerical summaries of data. And with what the method I highlight above, you should be.

Automating Understanding, Not Just ML

27 Jun

Some of the most complex parts of Machine Learning are largely automated. The modal ML person types in simple commands for very complex operations and voila! Some companies, like Microsoft (Azure) and DataRobot, also provide a UI for this. And this has generally not turned out well. Why? Because this kind of system does too little for the modal ML person and expects too much from the rest. So the modal ML person doesn’t use it. And the people who do use it, generally use it badly. The black box remains the black box. But not much is needed to place a lamp in this black box. Really, just two things are needed:

1. A data summarization and visualization engine, preferably with some chatbot feature that guides people smartly through the key points, including the problems. For instance, start with univariate summaries, highlighting ranges, missing data, sparsity, and such. Then, if it is a supervised problem, give people a bunch of loess plots or explain the ‘best fitting’ parametric approximations with y in plain English, such as, “people who eat 1 more cookie live 5 minutes shorter on average.”

2. An explanation engine, including what the explanations of observational predictions mean. We already have reasonable implementations of this.

When you have both, you have automated complexity thoughtfully, in a way that empowers people, rather than create a system that enables people to do fancy things badly.

About 85% Problematic: The Trouble With Human-In-The-Loop ML Systems

26 Apr

MIT researchers recently unveiled a system that combines machine learning with input from users to ‘predict 85% of the attacks.’ Each day, the system winnows down millions of rows to a few hundred atypical data points and passes these points on to ‘human experts’ who then label the few hundred data points. The system then uses the labels to refine the algorithm.

At the first blush, using data from users in such a way to refine the algorithm seems like the right thing to do, even the obvious thing to do. And there exist a variety of systems that do precisely this. In the context of cyber data (and a broad category of similar such data), however, it may not be the right thing to do. There are two big reasons for that. A low false positive rate can be much more easily achieved if we do not care about the false negative rate. And there are good reasons to worry a lot about false negative rates in cyber data. And second, and perhaps more importantly, incorporating user input on complex tasks (or where data is insufficiently rich) reduces to the following: given a complex task with inadequate time, the users use cheap heuristics to label the data, and supervised aspect of the algorithm reduces to learning cheap heuristics that humans use.

Clarifai(ng) the Future of Clarifai

31 Dec

Clarifai is a promising AI start-up. In a short(ish) time, it has made major progress on an important problem. And it is rapidly rolling out products with lots of business potential. But there are still some things that it could do.

As I understand it, the base version of Clarifai API is trying to do two things at once: a) learn various recognizable patterns in images b) rank the patterns based on ‘appropriateness’ and probab_true. I think Clarifai would have to split these two things over time and allow people to input what abstract dimensions are ‘appropriate’ for them. As the idiom goes, an image is a thousand words. In an image, there can be information about social class, race, and country, but also shapes, patterns, colors, perspective, depth, time of the day etc. And Clarifai should allow people to pick dimensions appropriate for the task. Though, defining dimensions would be hard. But that shouldn’t stymie the efforts. And ad hoc advances may be useful. For instance, one dimension could be abstract shapes and colors. Another could be the more ‘human’ dimension etc.

Extending the logic, Clarifai should support the building of abstract data science applications that solve a particular problem. For instance, say a user is only interested in learning about whether the photo features a man or a woman. And the user wants to build a Clarifai based classifier. (That person is me. The task is inferring gender of first names. See here.) Clarifai could in principle allow the user to train a classifier that uses all other information in the images, including jewelry, color, perspective, etc. and provide an out of sample error for that particular task. The crucial point is allowing users fuller access to what Clarifai can do and then letting the users manage it at their ends. To that end again, input about user objectives needs to be built into the API. Basic hooks could be developed for classification and clustering inputs.

More generally, Clarifai should eventually support more user inputs and a greater variety of outputs. Limiting the product to tagging is a mistake.

There are three other general directions for Clarifai to go into. A product that automatically sections an image into multiple images and tags each section would be useful. This would allow, for instance, to count the number of women in a photo. Another direction to go would be to provide the ‘best’ set of tags that collectively describe a set of images. (It may seem like violating the spirit of what I noted above but it needn’t — a user could want just this.) By the same token, Clarifai could build general-purpose discrimination engines — a list of tags that distinguishes image(s) the best.

Beyond this, the obvious. Clarifai can also provide synonyms of tags to make tags easier to use. And it could allow users to specify if they want, say tags in ‘UK English’ etc.

I Recommend It: A Recommender System For Scholarship Discovery

1 Oct

The rate of production of scholarship has never been higher. And while our ability to discover relevant scholarship per unit of time has never kept pace with the production of knowledge, it has also risen sharply—most recently, due to Google scholar.

The efficiency of discovery of relevant scholarship, however, has plateaued over the last few years, even as the rate of production of scholarship has kept its steady upward climb. Part of the reason why the growth has plateaued is because current ways of doing things cannot be made considerably more efficient very quickly. New growth in the rate of discovery will need knowledge discovery systems to get more data on the user’s needs, and access to structured databases of academic production.

The easiest next step would perhaps be to build a collaborative recommendation system. In particular, think of a system that takes your .bib file, trawls a large database of citation lists, for example, JSTOR or PLoS, and produces recommendations for scholarship you may have missed. The logic of a collaborative recommender system is pretty simple: learn from articles which have cited similar scholarship as you. If we have metadata on scholarship, for instance, sub-field, the actual text of an article or even the abstract, we could recommend based on the extent to which two articles cite the same kind of scholarly article. Direct elicitations (search terms are but one form) from the user can also be added to guide the recommendations. And meta-characteristics, for instance, page rank of a piece of scholarship, could be used to order recommendations.

Beyond Anomaly Detection: Supervised Learning from `Bad’ Transactions

20 Sep

Nearly every time you connect to the Internet, multiple servers log a bunch of details about the request. For instance, details about the connecting IPs, the protocol being used, etc. (One popular software for collecting such data is Cisco’s Netflow.) Lots of companies analyze this data in an attempt to flag `anomalous’ transactions. Given the data are low quality—IPs do not uniquely map to people, information per transaction is vanishingly small—the chances of building a useful anomaly detection algorithm using conventional unsupervised methods are extremely low.

One way to solve the problem is to re-express it as a supervised problem. Google, various security firms, security researchers, etc. everyday flag a bunch of IPs for various nefarious activities, including hosting malware (passive DNS), scanning, or actively . Check to see if these IPs are in the database, and learn from the transactions that include the blacklisted IPs. Using the model, flag transactions that look most similar to the transactions with blacklisted IPs. And validate the worthiness of flagged transactions with the highest probability of being with a malicious IP by checking to see if the IPs are blacklisted at a future date or by using a subject matter expert.

Some Aspect of Online Learning Re-imagined

3 Sep

Trevor Hastie is a superb teacher. He excels at making complex things simple. Till recently, his lectures were only available to those fortunate enough to be at Stanford. But now he teaches a free introductory course on statistical learning via the Stanford online learning platform. (I browsed through the online lectures — they are superb, the quality of his teaching, undiminished.)

Online learning, pregnant with rich possibilities, however, has met its cynics and its first failures. The proportion of students who drop-off after signing up for these courses is astonishing (but then so are the sign-up rates). And some very preliminary data suggest that compared to some face-to-face teaching, students enrolled in online learning don’t perform as well. But these are early days and much can be done to refine the systems and to also promote the interests of thousands upon thousands of teachers who have much to contribute to the world. In that spirit, some suggestions:

Currently, Coursera, edX and similar such ventures mostly attempt to replicate an off-line course online. The model is reasonable but doesn’t do justice to the unique opportunities of teaching online:

The Online Learning Platform: For Coursera etc. to be true learning platforms, they need to provide rich APIs for allowing development of both learning and teaching tools (and easy ways for developers to monetize these applications). For instance, professors could access to visualization applications and students access to applications that puts them in touch with peers interested in peer-to-peer teaching. The possibilities are endless. Integration with other social networks and other communication tools at the discretion of students are obvious future steps.

Learning from Peers: The single teacher model is quaint. It locks out contributions from other teachers. And it locks out teachers from potential revenues. We could turn it into a win-win. Scout ideas and materials from teachers and pay them for their contributions, ideally as a share of the revenue — so that they have a steady income.

The Online Experimentation Platform: So many researchers in education and psychology are curious and willing to contribute to this broad agenda of online learning. But no protocols and platforms have been established to exploit this willingness. The online learning platforms could release more data at one end. And at the other end, they could create a platform for researchers to option ideas that the community and company can vet. Winning ideas can form the basis of research that is implemented on the platform.

Beyond these ideas, there are ideas to do with courses that enhance student’s ability to succeed in these courses. Providing students ‘life skills’ either face-to-face or online, teaching them ways to become more independent students, can allow them to better adapt to the opportunities and challenges of learning online.

Toward a Two-Sided Market on Uber

17 Jun

Uber prices rides based on the availability of Uber drivers in the area, the demand, and the destination. This price is the same for whoever is on the other end of the transaction. For instance, Uber doesn’t take into account that someone in a hurry may be willing to pay more for a quicker service. By the same token, Uber doesn’t allow drivers to distinguish themselves on price (Airbnb, for instance, allows this). It leads to a simpler interface but it produces some inefficiency—some needs go unmet, some drivers go underutilized, etc. It may make sense to try out a system that allows for bidding on both sides.

Enabling FOIA: Getting Better Access to Government Data at a Lower Cost

28 May

Freedom of Information Act is vital for holding the government to account. But little effort has gone into building tools that enable fulfillment of FOIA requests. As a consequence of this underinvestment, fulfilling FOIA requests often means onerous, costly work for government agencies and long delays for the requesters. Here, I discuss a couple of alternatives. One tool that can prove particularly effective in lowering the cost of fulfilling FOIA requests is an anonymizer—a tool that detects proper nouns, addresses, telephone numbers, etc. and blurs them. This is easily achieved using modern machine learning methods. To ensure 100% accuracy, humans can quickly vet the suggestions by the algorithm along with ‘suspect words.’ Or a captcha like system that asks people in developing countries to label suspect words as nouns etc. can be created to further reduce costs. This covers one conventional solution. Another way to solve the problem would be to create a sandbox architecture. Rather than give requesters stacks of redacted documents, often unnecessary, one can allow people the right to run certain queries on the data. Results of these queries can be vetted internally. A classic example would be to allow people to query government emails for the total number of times porn domains are accessed via servers at government offices.

The Human and the Machine: Semi-automated approaches to ML

12 Apr

For a class of problems, a combination of algorithms and human input makes for the most optimal solution. For instance, three years ago software to recreate shredded documents that won the DARPA award used “human[s] [to] verify what the computer was recommending.” The insight is used in character recognition tasks. I have used it to create software for matching dirty data — the software was used to merge shape files with electoral returns at precinct level.

The class of problems for which human input proves useful has one essential attribute — humans produce unbiased, if error-prone, estimates for these problems. So for instance, it would be unwise to use humans for making the ‘last mile’ of lending decisions (see also this NYT article). (And that is something you may want to verify with training data.)

Big Data Algorithms: Too Complicated to Communicate?

11 Apr

“A decision is made about you, and you have no idea why it was done,” said Rajeev Date, an investor in data-science lenders and a former deputy director of Consumer Financial Protection Bureau

From NYT: If Algorithms Know All, How Much Should Humans Help?

The assertion that there is no intuition behind decisions made by algorithms strikes me as silly. So does the related assertion that such intuition cannot be communicated effectively. We can back out the logic for most algorithms. Heuristic accounts of the logic — e.g. which variables were important — can be given yet more easily. For instance, for inference from seemingly complicated-to-interpret methods such as ensemble methods, intuition for what variables are important can be gotten in the same way as it is gotten for methods like bagging. However, even when specific points are hard to convey, the meta-logic of the system can be explained to the end user.

What is true, however, is that it isn’t being done. For instance, WSJ covering Orion routing system at UPS reports:

“For example, some drivers don’t understand why it makes sense to deliver a package in one neighborhood in the morning, and come back to the same area later in the day for another delivery. …One driver, who declined to speak for attribution, said he has been on Orion since mid-2014 and dislikes it, because it strikes him as illogical.”

WSJ: At UPS, the Algorithm Is the Driver

Communication architecture is an essential part of all human focused systems. And what to communicate when are important questions that deserve careful thought. The default cannot be no communication.

The lack of systems that communicate intuition behind algorithms strikes me as a great opportunity. HCI people — make some money.

Idealog: Monetizing Websites using Mechanical Turk

14 Jul

Mechanical Turk, a system for crowd-sourcing complex Human Intelligence Tasks (HITs) through parceling a large task and putting the parcels on a marketplace, has seen a recent resurgence, partly on the back of the success of easier to work with clones like crowdflower.com.

Given that on the Internet people find it easier to part with labor than with money, one way to monetize websites may be to request users to help pay for the site by doing a task on Mechanical Turk, made easily available as part of the website. Websites can ask users to fill out the kind of tasks they would like to work on as part of their profiles.

For example, a news website may also provide its users a choice between watching ads and doing a small number of tasks for every X number of articles that they read. Similarly, a news website may ask its users to proofread a sentence or two of its own articles, thereby reducing costs of its production.

Some Notable Ideas

30 Jun

Education of a senator

While top colleges have been known to buck their admissions criteria for the wrong reasons, they do on average admit smarter students, and the academic training they provide is easily above average. So it may be instructive to know what percentage of Senators, the most elite of the political brass, have graced the hallowed corridors of top academic institutions, and if at all the proportion attending top colleges varies by party.

Whereas 5 of the 42 Republican senators have attended top colleges (in the top 20), 22 of the 57 Democratic Senators have done the same. The skew in numbers may be because New England is home to both top schools and a good many Democratic Senators. But privilege accorded by accident of geography is no less consequential than one afforded by some more equitable regime. As in if people of fundamentally similar caliber are there in Senate in either party, a belief cursory viewing of CSPAN would absolve one-off, some due to the happenstance of geography have been trained better than others.

Discounting elite education, one may simply want to look at the extent of education. There one finds that whereas 73% of the Democratic Senators have an advanced degree, only 64% of Republican Senators have the same. While again it is likely that going to top schools increases admission to good advanced degree programs, thereby making advanced education more attractive perhaps, there again one can conjecture that whatever the reason – some groups of people, due to mere privilege perhaps, have better skills than others.

Why do telephones have numbers?
Unique alphabetical addresses could always be uniquely coded using numbers or translated as signals. However technological limitations meant that such coding wasn’t widely practiced. This meant people were forced to memorize multiple long strings of numbers, a skill most people never prided themselves with, and when forced to learn it, they took to it like fish to land. Freedom from such bondage would surely be welcomed by many. And now, unsparing hands of technological change which have made such coding ubiquitous hope to deliver comeuppance to telephone numbers. Providing each phone with an alphabetical ID that maps to its IP address would be one smart way of implementing this change.

Misinformation and Agenda Setting
People actively learn from the environment, though they do so inexpertly. Given an average American consumes over 60 hrs of media each week, one prominent place from they learn is the media. Say if one were to watch a great many crime shows – merely as a result of the great many crime shows on television – one may impute wrongly that crime rate has been increasing. A similar, though more potent, example of the this may be the 70% of Americans who believe breast cancer is the number one reason for female mortality.

Idealog: Opening Interfaces to Websites and Software

20 Mar

Imagine the following scenario: You go to NYTimes.com, and are offered a choice between a variety of interfaces, not just skins or font adjustments, built for a variety of purposes by whoever wants to put in the effort. You get to pick the interface that is right for you – carries the stories you like, presented in the fashion you prefer. Wouldn’t that be something?

Currently, we are made to choose between using weak customization options or build some ad hoc interface using RSS readers. What is missing is open-sourced or internally produced a selection of interfaces that cater to diverse needs, and wants of the public.

We need to separate data from the interface. Applying it to the example at hand – NY Times is in the data business, not the interface business. So like Google Maps, it can make data available, with some stipulations, including monetization requirements, and let others take charge of creatively thinking of ways that data can be presented. If this seems too revolutionary, more middle of the road options exist. NY Times can allow people to build interfaces which are then made available on the New York Times site. For monetization, NY Times can reserve areas of real estate, or charge users for using some ad-free interfaces.

This trick can be replicated across websites, and easily extended to software. For example, MS-Excel can have a variety of interfaces, all easily searchable, downloadable, and deployable, that cater to specific needs of say, Chemical Engineers, or Microbiologists, or programmers. The logic remains the same – MS needn’t be in the interface business, or more limitedly, doesn’t need to control it completely or inefficiently (for it does allow tedious customization), but can be a platform on which people can build, and share, innovative ways to exploit the underlying engine.

An adjacent broader and more useful idea is to come up with a rich interface development toolkit that provides access to processed open data.

Idealog: Internet Panel + Media Monitoring

4 Jan

Media scholars have for long complained about the lack of good measures of media use. Survey self-reports have been shown to be notoriously unreliable, especially for news, where there is significant over-reporting, and without good measures, research lags. The same is true for most research in marketing.

Until recently, the state of the art aggregate media use measures were Nielsen ratings, which put a `meter’ in a few households, or asked people to keep a diary of what they saw. In short, the aggregate measures were pretty bad as well. Digital media, which allows for effortless tracking, and the rise of Internet polling however for the first time provides an opportunity to create `panels’ of respondents for whom we have near perfect measures of media use. The proposal is quite simple: create a hybrid of Nielsen on steroids and YouGov/Polimetrix or Knowledge Network kind of recruiting of individuals.

Logistics: Give people free cable and Internet (~ 80/month) in return for 2 hours of their time per month and monitoring of media consumption. Pay people who already have cable (~100/month) for installing a device and software. Recording channel information is enough for TV, but Internet equivalent of a channel—domain—clearly isn’t, as people can self-select within websites. So we only need to monitor the channel for TV but more for the Internet.

While the number of devices on which people browse the Internet, and watch TV has multiplied, there generally remains only one `pipe’ per house. We can install a monitoring device at the central hub for cable, and automatically install software for anyone who connects to the Internet router or do passive monitoring on the router. Monitoring can also be done through applications on mobile devices.

Monetizability: Consumer companies (say Kellog’s, Ford), Communication researchers, Political hacks (e.g. how many watched campaign ads) will all pay for it. The crucial innovation (modest) is the addition of the possibility to survey people on a broad range of topics, in addition to getting great media use measures.

Addressing privacy concerns:

  1. Limit recording information to certain channels/websites, ones on which customers advertise, etc. This changing list can be made subject to approval by the individual.
  2. Provide for a web-interface where people can look/suppress the data before it is sent out. Of course, reconfirm that all data is anonymous to deter such censoring.

Ensuring privacy may lead to some data censoring and we can try to prorate the data we get it a couple of ways –

  • Survey people on media use
  • Use Television Rating Points (TRP) by sociodemographics to weight data.

Marginal value of ‘Breaking News’

12 Mar

Media organizations spend millions of dollars each year trying to arrange for the logistics of providing ‘breaking news’. They send overzealous reporters, and camera crew to far off counties (and countries -though not as often in US), pay for live satellite uplink, and pay for numerous other little logistical details to make ‘live news’. They do so primarily to compete and to out do each other but when queried may regale you with mythical benefits of providing the death of a soldier in Iraq a few minutes early to a chronically bored, apathetic US citizen. The fact is that there is little or no value whatsoever for a citizen of breaking news for a large range of events. Breaking news is provided primarily as a way to introduce drama into the news cast and done so in a style to exaggerate the importance of the miniscule and the irrelevant.

The more insidious element of breaking news is that repeated news stories about marginal events, which most breaking news events are – for example a small bomb blast in Iraq, a murder in some small town in Michigan, provide little or no information to a citizen consumer about the relative gravity of the event or its relative importance. In doing so they make a citizen consumer think either that all news (and issues) is peripheral or that these minor events are of critical importance. Either way, they do a disservice to the society at large.

This doesn’t quite end the laundry list of deleterious effects of breaking news. Focus on breaking news makes sure that most attention is given to an issue when the journalists on the ground typically know the least about the issue. To take this a step further – often times the ‘sources’ for reporting during the initial few minutes of an event are often times ‘official sources’. In doing so the breaking news format legitimizes the official version of the news which then gets corrected a week or a month later in the back pages of a newspaper.

While there is little hope that the contagion of ‘breaking news’ will ever stop (and it stands to believe that web, radio and television will continue to be afflicted by the malaise), it is possible for people to opt for longer better reported articles in good magazines or learn about an issue or an event through Wikipedia, as Chaste in his column for this site suggested earlier.

Interview with Bill Thompson: Fragmented Information

10 Mar

This is the fourth and concluding part of the interview with BBC technology columnist, Mr. Bill Thompson.

part 1, part 2, part 3

This kind of completes two of the major questions that I had. I would now move on to digital literacy and fragmented informational landscape. Google has made facts accessible to people – too accessible, some might say. What Google has done is allowed the people to pick up little facts, disembodied and without the contextual information. It may lead to a consumer who has a very particularistic trajectory of information and opinions. Do you see that as a possibility or does the fundamental interlinked nature of the Internet somehow manages to make information accessible in a more complete way? In a related point do you see that while we are becoming information-rich, we are also simultaneously becoming knowledge poor?

That is such a big question. In fact, I share your concerns. I think there is a real danger – that it’s not even just that there is sort of a surfeit of facts and a lack of knowledge, its that the range of facts which we have available to us becomes defined by what is accessible through Google. And as we know that even Google, or any other search engine, only indexes a small portion of the sum of human knowledge, of the sum of what is available. And we see that this effect also becomes self-reinforcing so that somebody is researching something and they search on Google, find some information, they then reproduce that information and link to its source and it becomes therefore even more dominant, it becomes more likely to be the thing people will find next time they search and as a result alternative points of view, more obscure references, the more complex stuff which is harder to simplify and express drops down the Google ranking and essentially then becomes invisible.

There is much to be said for hard research that takes time, that is careful, that uncovers this sort of deeper information and makes it available to other people. We see in the world of non-fiction publishing, particularly I think with history every year or two we see a radical revisionist biography of some major historical figure based on a close reading of the archives or access to information which was previously unavailable. So all the biographies of Einstein are having to be rewritten at the moment because his letters from the 1950s have just become available and they give us a very different view of the man and particularly of his politics. Now if our view of Einstein was one defined by what Google finds out about Einstein we would know remarkably little. So we need scholars, we need the people who are always going to delve a little more deeply and there is danger in the Google world – it becomes harder to do that and fewer people will even have access to the products of their [careful researcher’s] work because what they write will not itself make it high up the ranking, will not have a sufficient ‘page rank’.

So I actually do think Google and the model of information access which it presents us is one that should be challenged and it should only ever be one part of the system. It is a bit like Wikipedia. I teach a journalism class and I say to my students that Wikipedia may be a good place to start your research but it must never be the place to finish it. Similarly, with Google, anybody who only uses the Google search engine knows too little about the world.

You bring up an important point. Search engine design, and other web usage patterns are increasingly channeling users to a small set of sites with a particular set of knowledge and viewpoints. But hasn’t that always been the case? An epidemiological study of how knowledge has traditionally spread in the world would probably show that at any one time only a small amount of knowledge is available to most people while most other knowledge withers into oblivion. So has Google really fundamentally changed the dynamics?

You are trying to do that to me again and I won’t let you.

This is not a fundamental shift in what it means to be human. None of this is a fundamental shift in what it means to be a human. Things may be faster, we may more access or whatever but we have always had these problems and we have always found solutions to them. And I am not a sort of a millenialist about this; I don’t think this is the end of civilization. I think we face short-term issues and we historically have found a way around them and we will again. That Google’s current dominance is a blip. In a sense – it will go, I don’t know how. Ok, here’s a good way in which Google’s dominance could go. So, at the moment we have worries in the world about H5N1 avian flu mutating into a form which infects humans. Let’s just suppose that this happens and that somebody somewhere writes an obscure academic paper which describes how basically to cure it and how to prevent infection in your household. Well all the people who rely on Google won’t find this paper will die and all the people who go to their library and look up the paper version will live and therefore the Google world will be over. How about that? There is something, perhaps not quite on that scale, something will happen which will force us to question our dependence on Google and that would be a good thing. We shouldn’t ever depend on anyone like that.

You know Mr. Thompson, even libraries have sort of shifted. They are increasingly interested in providing Internet access.

Yeah, it is and it is search rather than structure. And you know the fact is that search tools make it easy to be lazy and we are a lazy species and therefore we will lazy and we will carry on being lazy until we are forced until something bad happens because of our laziness at which point we will mend our ways.

That’s why I had brought up the question of fragmented knowledge earlier. One of my close friends is blind and he generally has to read through the book to reach the information that he wants. He tends to have a much fuller idea of context and the kind of corroboration that he presents is much different from the casual kind of scattered anecdotal argumentation that others present. Of course part of that is a function of he being a conscientious arguer but certainly part of it stems from he not having as many shortcuts to knowledge and actually having a fuller contextual understanding of the topic at hand. The fact is that most users can now parachute in and out of information and Google has helped make it easier.

I don’t think we see what’s really going on. There is a lot more information and there is a lot more to cope with and this superficial skimming is a very effective strategy. Skim reading is something we know how to do, we teach our children how to do, we value in ourselves and indeed in them, and skim surfing is just as valuable. You know I monitor thirty-forty blogs, news sites and stuff like that and when I am doing it, I don’t look too closely at things. That doesn’t mean that I don’t have the ability or the facility to do something which is a lot deeper and a lot more involved.

I have a fifteen-year daughter. She is doing her GCSE exams this year. And I have watched over the last 18 months or so how she has developed her ability to focus, her research skills, her reading around, she is surrounded by a pile of books, she has stopped using the computer as the way to find things quickly because she now needs to know stuff in depth and she is doing all of that. So I suspect that from the outside observing children we seem them in a certain way because we only see part of what they do and we have to look in more detail. It is too easy to have the wrong idea and actually I am a lot more hopeful about this, having seen this with my daughter and I think I will start to see it with my son, who is fourteen at the moment. And again I see his application to the things he cares about and the way he searches. He is a big fan of The Oblivion, the X-Box game, his engagement and the depth of his understanding is immense. So we shouldn’t let the fact that we look at some domain of activity where they are purely superficial let us lose sight of the fact of other areas where it is not superficial at all, where they have developed exactly those skills which would want them to have.

——–

Bill Thompson’s blog