Bad Service: Missing Functionality in Web Services

30 Sep

A host of cloud services are missing the core functionality needed to build businesses on top of the services. Powerful services on mature platforms like Google Vision, etc., have a common set of deficiencies—they do not allow clients to send information about preferred latency and throughput (for a price), and they do not allow clients to programmatically define SLAs (again, for a price). (If you read the documents of Google Vision, there is no mention of how quickly the API will return the answer for a document of a particular size.) One price for ~all requesters is the norm. Not only that, in the era of ‘endless scalable compute,’ throttling is ubiquitous.

There are two separate ideas here. The first is about how to solve one-off needs around throughput and latency. For a range of services, we can easily provide a few options that price in bandwidth and server costs. For a certain volume of requests, the services may require that the customer send a request outlining the need with enough lead time to boot new servers. The second idea is about programmatically signing SLAs. Rather than asking customers to go back and forth with Sales around custom pricing for custom needs, providing a few options for a set of standard use cases may be more expedient.

Some low-level services like s3 work almost like that today. But the move to abstracting out this paradigm to higher-level services has largely not begun. I believe it is time.

Reducing Friction in Selling Data Products: Protecting IP and Data

15 Sep

Traditionally software has been distributed as a binary. The customer “grants” the binary a broad set of rights on the machine and expects the application to behave, e.g., not snoop on personal data, not add the computer to a botnet, etc. Most SaaS can be delivered with minor alterations to the above—finer access control and usage logging. Such systems work on trust—the customer trusts that the vendor will do the right thing. It is a fine model but does not work for the long tail. For the long tail, you need a system that grants limited rights to the application and restricts what data can be sent back. This kind of model is increasingly common on mobile OS but absent on many other “platforms.”

The other big change over time in software has been how much data is sent back to the application maker. In a typical case, the SaaS application is delivered via a REST API, and nearly all the data is posted to the application’s servers. This brings up issues about privacy and security, especially for businesses. Let me give an example. Say there is an app that can summarize documents. And say that a business has a few million documents in a Dropbox folder on which it would like to run this application. Let’s assume that the app is delivered via a REST API, as many SaaS apps are. And let’s assume that the business doesn’t want the application maker to ‘keep’ the data. What’s the recourse? Here are a few options:

  • Trust me. Large vendors like Google can credibly commit to models where they don’t store customer data. To the extent that storing customer data is valuable to the application developer, the application developer can also use price discrimination, providing separate pricing tiers for cases where the data is logged and where it isn’t. For example, see the Google speech-to-text API.
  • Trust but verify. The application developer claims to follow certain policies, but the customer is able to verify, for e.g., audit access policies and logs. (A weaker version of this model is relying on industry associations that ‘certify’ certain data handling standards, e.g., SOC2.)
  • Trusted third-party. The customer and application developer give some rights to a third party that implements a solution that ensures privacy and protects the application developer’s IP. For instance, AWS provides a model where the customer data and algorithm are copied over to an air-gapped server and the outputs written back to the customer’s disk. 

Of the three options, the last option likely reduces friction the most for long tail applications. But there are two issues. First, such models are unavailable on a wide variety of “platforms,” e.g., Dropbox, etc. (or easy integrations with the AWS offering are uncommon). The second is that air-gapped copying is but one model. A neutral third party can provide interesting architectures, including strong port observability and customer-in-the-loop “data emission” auditing, etc.  

Back to the Future: Engineering Without Data (Models)

18 Jul

Software engineering has changed dramatically in the last few decades. The rise of AWS, high-level languages, powerful libraries, and frameworks increasingly allow engineers to focus on business logic. Today, software engineers spend much of their time writing code that reasons over data to show something or do something. But how engineering is done has not caught up in some crucial ways:

  1.  Software Development Tools. Most data scientists today work in a notebook on a server where they heavily interact with the data as they refine the code (algorithm). Most engineers still work locally without access to production data. Part of the reason engineers don’t have access to the data is because they work locally—for security and compliance reasons, access to production data from the local machine is banned in most places. Plausibly, a bigger reason is that engineers are stuck in a paradigm where they don’t think access to production data is foundational to faster, higher-quality software development. This belief is reflected in the ad-hoc solutions to the problem that are being tried across the industry, e.g., synthetic data (which is hard to create, maintain, and scale).
  2. Data Modeling. The focus on data modeling has sharply decreased over time in many companies. There are at least four underlying forces behind this trend. First, the combination of the volume of the data being generated and the rise of cheap blob storage (combined with the fact that computing power is comparatively vastly more expensive today) incentivizes the storage of unstructured data. Second, agile development, which prioritizes customer-facing progress over short time units, may cause underinvestment in costly, foundational work (see here). Third, the engineering organizations are changing in that the producers of the data are no longer seen as owners of the data. The fourth and last point is perhaps the most crucial—the surfeit of data has led to some magical thinking about the ease with which data can be used to power insights. Our ability to derive business insights from unstructured and dirty data, except for a small minority of cases, e.g., search, doesn’t exist. The only thing the surfeit of data has done is that it has widened and deepened the pool of insights that can be delivered. It hasn’t made it any easier to derive those insights, which continue to rely on good old-fashioned manual work to understand the use case and curate and structure the data appropriately. (It also then becomes an opportunity for building software.)

    Engineers pay the price of not investing in data modeling by making the code more complex (and hence, more unmaintainable) and by allocating time to fix “bugs.” (The reason I put the word bugs in air quotes is because obvious consequences of a bad system should not be called bugs.)
  3. Data Drift. Machine Learning Engineers (MLEs) obsess about it. Most other engineers haven’t ever heard of the term. Everyone should worry. Technically, the only difference between using ML and engineering for rule creation is that ML auto-creates rules while conventional engineering relies on handcrafting the rules. Both systems test the efficacy of their rules on the current data. Both systems assume that the data will not drift. Only MLEs monitor the data, thinking hard about what data the rules work for and how to monitor data drift. Other engineers need to sign up.

The solutions are as simple as the problems are immense: invest in data quality, data monitoring, and data models. To achieve that, we need to change how organizations are structured, how they are run, and what engineers think the hard problems are.

ML (O)Ops! Keeping Track of Changes (Part 2)

22 Mar

The first part of the series, “Improving and Deploying On-Device Models With Confidence”, is posted here.

With Atul Dhingra

One way to automate classification is to compare new instances to a known list and plug in the majority class of the exact match. For such instance-based learning, you often don’t need to version data; you just need a hash table. When you are not relying on an exact match—most machine learning—you often need to version data to reproduce the behavior.

Reproducibility is the bedrock of mature software engineering. It is fundamental because it allows you to diagnose issues. You can reproduce the behavior of a ‘version.’ With that power, you can correlate changes in inputs with changes in outputs. Systems that enable reproducibility, like version control, have another vital purpose—reducing risk stemming from changes and allow regression testing in systems that depend on data, such as ML. They reduce it by allowing for changes to be rolled back. 

To reproduce outputs from machine learning models, we need to do more than store data. We also need to store hyper-parameters, details about the OS, programming language, and packages, among other things. But given the primary value of reproducibility is instrumental—diagnosis—we not just want the ability to reproduce but also the ability to understand changes and correlate them. Current solutions miss the mark.

Current Solutions and Problems

One way to version data is to treat it as a binary blob. Store all the data you learned a model on to a server and store a reference to the data in your repository. If the data changes, store the new version and create a new pointer. One downside of using a <code>git lfs</code> like mechanism is that your storage blows up. Another is that build times can be large if the local cache is small or more generally if access costs are large. Yet another problem is the lack of a neat interface that allows you to track more than source data. 

DVC purports to solve all three problems. It solves the first by providing a way to not treat the data as a blob. For instance, in a computer vision workflow, the source data is image files with some elementary tags—labels, assignments to train and test, etc. The differences between data versions are 1) changes in images (additions mostly) and 2) changes in mapping to labels and assignments. DVC allows you to store the differences in corpora of images as a list of additional hashes to files. DVC is silent on the second point—efficient storage of changes in mappings. We come to it later. DVC purports to solve the second problem by allowing you to save to local cloud storage. But it can still be time-consuming to download data from the cloud storage buckets. The reason is as follows. Each time you want to work on an experiment, you need to clone the entire cache to check out the appropriate files. And if not handled properly, the cloning time often significantly exceeds typical training times. Worse, it locks you into a cloud provider for any optimizations you may want to alleviate these time-bound cache downloads. DVC purports to solve the last problem by using yaml, tags, etc. But anarchy prevails. 

Future Solutions

Interpretable Changes

One of the big problems with data versioning is that the diffs are not human-readable, much less comprehensible. The diffs are usually very long, and the changes in the diff are hashes, which means that to review an MR/PR/Diff, the reviewer has to check out the change and pull the data with the updated hashes. The process can be easily improved by adding an extra layer that auto-summarizes the changes into a human-readable form. We can, of course, easily do more. We can provide ways to understand how changes to inputs correlate with changes in outputs.

Diff. Tables

The standard method of understanding data as a blob seems uniquely bad. For conventional rectangular databases, changes can be understood as changes in functional transformations of core data lake tables. For instance, say we store the label assignments of images in a table. And say we revise the labels of 100 images. (The core data lake tables are immutable, so the changes are executed in the downstream tables.) One conventional way of storing the changes is to use a separate table for recording changes. Another is to write an update statement that is run whenever “the v2” table is generated. This means the differences across data are now tied to a data transformation computation graph. When data transformation is inexpensive, we can delay running the transformations till the table is requested. In other cases, we can cache the tables.

ML (O)Ops! Improving and Deploying On-Device Models With Confidence (Part 1)

21 Feb

With Atul Dhingra.

Part 1 of a multi-part series.

It is well known that ML Engineers today spend most of their time doing things that do not have a lot to do with machine learning. They spend time working on technically unsophisticated but important things like deployment of models, keeping track of experiments, etc.—operations. Atul and I dive into the reasons behind the status quo and propose solutions, starting with issues to do with on-device deployments. 

Performance on Device

The deployment of on-device models is complicated by the fact that the infrastructure used for training is different from what is used for production. This leads to many tedious rollbacks. 

The underlying problem is missing data. We are missing data on the latency in prediction, which is a function of i/o latency and the time taken to compute. One way to impute the missing data is to build a model that predicts latency based on various features of the deployed model. Given many companies have gone through thousands of deployments and rollbacks, there is rich data to learn from. Another is to directly measure the time with ‘shadow deployments—performance on redundant chips colocated with the production chip and getting exactly the same data at about the same time (a small lag in passing on the data to the redundant chips is just fine as we can start the clock at a different time).  

Predicting latency given a model and deployment architecture solves the problem of deploying reliably. It doesn’t solve the problem of how to improve the performance of the system given a model. To improve the production performance of ML systems, companies need to analyze the data, e.g., compute the correlation between load on the edge server and latency, and generate additional data by experimenting with various easily modifiable parts of the system, e.g., increasing capacity of the edge server, etc. (If you are a cloud service provider like AWS, you can learn from all the combinations of infrastructure that exist to predict latency for various architectures given a model and propose solutions to the customer.)

There is plausibly also a need for a service that helps companies decide which chip is optimal for deployment. One solution to the problem is MLPerf.org as a service— a service that provides data on the latency of a model on different chips. 

What’s the Next Best Thing to Learn?

10 Oct

With Gaurav Gandhi

Recommendation engines are everywhere. These systems recommend what shows to watch on Netflix and what products to buy on Amazon. Since at least the Netflix Prize, the conventional wisdom is that recommendation engines have become very good. Except that they are not. Some of the deficiencies are deliberate. Netflix has made a huge bet on its shows, and it makes every effort to highlight its Originals over everything else. Some other efficiencies are a result of a lack of content. The fact is easily proved. How often have you done a futile extended search for something “good” to watch?

Take the above two explanations out, and still, the quality of recommendations is poor. For instance, Youtube struggles to recommend high-quality, relevant videos on machine learning. It fails on relevance because it either recommends videos that are too difficult or too easy. And it fails on quality—the opaqueness of explanation makes most points hard to understand. When I look back, most of the high-quality content on machine learning that I have come across is a result of subscribing to the right channels—human curation. Take another painful aspect of most recommendations: a narrow understanding of our interests. You watch a few food travel shows, and YouTube will recommend twenty others.

Problems

What is the next best thing to learn? It is an important question to ask. To answer it, we need to know the objective function. But the objective function is too hard to formalize and yet harder to estimate. Is it money we want, or is it knowledge, or is it happiness? Say we decide its money. For most people, after accounting for risk, the answer may be: learn to program. But what would the equilibrium effects be if everyone did that? Not great. So we ask a simpler question: what is the next reasonable unit of information to learn?

Meno’s paradox states that we cannot be curious about something that we know. Pair it with a Rumsfeld-ism: we cannot be curious about things we don’t know we don’t know. The domain of things we can be curious about hence are things we know that we don’t know. For instance, I know that I don’t know enough about dark matter. But the complete set of things we can be curious about includes things we don’t know we don’t know.

The set of things that we don’t know is very large. But that is not the set of information we can know. The set of relevant information is described by the frontier of our knowledge. The unit of information we are ready to consume is not a random unit from the set of things we don’t know but the set of things we can learn given what we know.  As I note above, a bunch of ML lectures that YouTube recommends are beyond me. 

There is a further constraint on ‘relevance.’ Of the relevant set, we are only curious about things we are interested in. But it is an open question about how we entice people to learn about things that they will find interesting. It is the same challenge Netflix faces when trying to educate people about movies people haven’t heard or seen.

Conditional on knowing the next best substantive unit, we care about quality.  People want content that will help them learn what they are interested in most efficiently. So we need to solve for the best source to learn the information.

Solutions

Known-Known

For things we know, the big question is how do we optimally retain things we know. It may be through Flashcards or what have you.

Exploring the Unknown

  1. Learn What a Person Knows (Doesn’t Know): The first step is in learning the set of information that the person doesn’t know is to learn what a person knows. The best way to learn what a person knows is to build a system that monitors all the information we consume on the Internet. 
  1. Classify. Next, use ML to split the information into broad areas. 
  1. Estimate The Frontier of Knowledge. To establish the frontier of knowledge, we need to put what people know on a scale. We can derive that scale by exploiting syllabi and class structure (101, 102, etc.) and associated content and then scaling all the content (Youtube video, books, etc.) by estimating similarity with the relevant level of content. (We can plausibly also establish a scale by following the paths people follow–videos that they start but don’t finish are good indications of being too easy or too hard, for instance.)

    We can also use tools like quizzes to establish the frontier, but the quizzes will need to be built from a system that understands how information is stacked.
  1. Estimate Quality of Content. Rank content within each topic and each level by quality. Infer quality through both explicit and implicit measures. Use this to build the relevant set.
  1. Recommend from the relevant set. Recommend a wide variety of content from the relevant set.

everywhere: meeting consumers where they are

1 Sep

Content delivery is not optimized for the technical stack used by an overwhelming majority of people. The technical stack of people who aren’t particularly tech-savvy, especially those who are old (over ~60 years), is often a messaging application like FB Messenger or WhatsApp. They currently do not have a way to ‘subscribe’ to Substack newsletters or podcasts or Youtube videos in the messaging application that they use (see below for an illustration of how this may look on the iPhone messaging app.) They miss content. And content producers have an audience hole.

Credit: Gaurav Gandhi

A lot of the content is distributed only via email or distributed within a specific application. There are good strategic reasons for that—you get to monitor consumption, recommend accordingly, control monetization, etc. But the reason why platforms like Substack, which enable independent content producers, limit distribution to email is not as immediately clear. It is unlikely a deliberate decision. It is likely a decision based on a lack of infrastructure that connects publishing to various messaging platforms. The future of messaging platforms is Slack—a platform that integrates as many applications as possible. As Whatsapp rolls out its business API, there is a potential to build an integration that allows producers to deliver premium content, leverage other kinds of monetization, like ads, and even build a recommendation stack. Eventually, it would be great to build that kind of integration for each of the messaging platforms, including iMessage, FB Messenger, etc.

Let me end by noting that there is something special about WhatsApp. No one has replicated the mobile phone-based messaging platform. And the idea of enabling a larger stack based on phone numbers remains unplumbed. Duo and FaceTime are great examples but there is potential for so much more. For instance, a calendar app. that runs on the mobile phone ID architecture.

Growth Funnels

24 Sep

You spend a ton of time building a product to solve a particular problem. You launch. Then, either the kinds of people whose problem you are solving never arrive or they arrive and then leave. You want to know why because if you know the reason, you can work to address the underlying causes.

Often, however, businesses only have access to observational data. And since we can’t answer the why with observational data, people have swapped the why with the where. Answering the where can give us a window into the why (though only a window). And that can be useful.

The traditional way of posing ‘where’ is called a funnel. Conventionally, funnels start at when the customer arrives on the website. We will forgo convention and start at the top.

There are only three things you should work to optimally do conditional on the product you have built:

  1. Help people discover your product
  2. Effectively convey the relevant value of the product to those who have discovered your product
  3. Help people effectively use the product

p.s. When the product changes, you have to do all three all over again.

One way funnels can potentially help is triage. How big is the ‘leak’ at each ‘stage’? The funnel on the top of the first two steps is: of the people who discovered the product, how many did we successfully communicate the relevant value of the product to? Posing the problem in such a way makes it seem more powerful than it is. To come up with a number, you generally only have noisy behavioral signals. For instance, the proportion of people who visit the site who sign up. But low proportions could be of large denominators—lots of people are visiting the site but the product is not relevant for most of them. (If bringing people to the site costs nothing, there is nothing to do.) Or it could be because you have a super kludgy sign-up process. You could drill down to try to get at such clues, but the number of potential locations for drilling remains large.

That brings us to the next point. Macro-triaging is useful but only for resource allocation kinds of decisions based on some assumptions. It doesn’t provide a way to get concrete answers to real issues. For concrete answers, you need funnels around concrete workflows. For instance, for a referral ‘product’ for AirBnB, one way to define steps (based on Gustaf Alstromer) is as follows:

  1. how many people saw the link asking people to refer
  2. of the people who saw the link, how many clicked on it
  3. of the people who clicked on it, how many invited people
  4. of the people who were invited, how many signed up (as a user, guest, host)
  5. of the people who signed up, how many made the first booking

Such a funnel allows you to optimize the workflow by experimenting with the user interface at each stage. It allows you to analyze what the users were trying to do but failed to do, or took a long time doing. It also allows you to analyze how optimally (from a company’s perspective) are users taking an action. For instance, people may want to invite all their friends but the UI doesn’t have a convenient way to import contacts from email.

Sometimes just writing out the steps in a workflow can be useful. It allows people to think if some steps are actually needed. For instance, is signing-in needed before we allow a user to understand the value of the product?

Quality Data: Plumbing ML Data Pipelines

6 Aug

What’s the difference between a scientist and a data scientist? Scientists often collect their own data, and data scientists often use data collected by other people. That is part jest but speaks to an important point. Good scientists know their data. Good data scientists must know their data too. To help data scientists learn about the data they use, we need to build systems that give them good data about the data. But what is good data about the data? And how do we build systems that deliver that? Here’s some advice (tailored toward rectangular data for convenience):

  • From Where, How Much, and Such
    • Provenance: how were each of the columns in the data created (obtained)? If the data are derivative, find out the provenance of the original data. Be as concrete as possible, linking to scripts, related teams, and such.
    • How Frequently is it updated
    • Cost per unit of data, e.g., a cell in rectangular data.
    Both, the frequency with which data are updated, and the cost per unit of data may change over time. Provenance may change as well: a new team (person) may start managing data. So the person who ‘owns’ the data must come back to these questions every so often. Come up with a plan.
  • What? To know what the data mean, you need a data dictionary. A data dictionary explains the key characteristics of the data. It includes:
    1. Information about each of the columns in plain language.
    2. How were the data were collected? For instance, if you conducted a survey, you need the question text and the response options (if any) that were offered, along with the ‘mode’, and where in a sequence of questions does this lie, was it alone on the screen, etc.
    3. Data type
    4. How (if at all) are missing values generated?
    5. For integer columns, it gives the range, sd, mean, median, n_0s, and n_missing. For categorical, it gives the number of unique values, what each label means, and a frequency table that includes n_missing (if missing can be of multiple types, show a row for each).
    6. The number of duplicates in data and if they are allowed and a reason for why you would see them. 
    7. Number of rows and columns
    8. Sampling
    9. For supervised models, store correlation of y with key x_vars
  • What If? What if you have a question? Who should you bug? Who ‘owns’ the ‘column’ of data?

Store these data in JSON so that you can use this information to validate against. Produce the JSON for each update. You can flag when data are some s.d. above below last ingest.

Store all this metadata with the data. For e.g., you can extend the dataframe class in Scala to make it so.

Auto-generate reports in markdown with each ingest.

In many ML applications, you are also ingesting data back from the user. So you need the same as above for the data you are getting from the user (and some of it at least needs to match the stored data). 

For any derived data, you need the scripts and the logic, ideally in a notebook. This is your translation function.

Where possible, follow the third normal form of databases. Only store translations when translation is expensive. Even then, think twice.

Lastly, some quality control. Periodically sit down with your team to see if you should see what you are seeing. For instance, if you are in the survey business, do the completion times make sense? If you are doing supervised learning, get a random sample of labels. Assess their quality. You can also assess the quality by looking at errors in classification that your supervised model makes. Are the errors because the data are mislabeled? Keep iterating. Keep improving. And keep cataloging those improvements. You should be able to ‘diff’ data collection, not just numerical summaries of data. And with what the method I highlight above, you should be.

Automating Understanding, Not Just ML

27 Jun

Some of the most complex parts of Machine Learning are largely automated. The modal ML person types in simple commands for very complex operations and voila! Some companies, like Microsoft (Azure) and DataRobot, also provide a UI for this. And this has generally not turned out well. Why? Because this kind of system does too little for the modal ML person and expects too much from the rest. So the modal ML person doesn’t use it. And the people who do use it, generally use it badly. The black box remains the black box. But not much is needed to place a lamp in this black box. Really, just two things are needed:

1. A data summarization and visualization engine, preferably with some chatbot feature that guides people smartly through the key points, including the problems. For instance, start with univariate summaries, highlighting ranges, missing data, sparsity, and such. Then, if it is a supervised problem, give people a bunch of loess plots or explain the ‘best fitting’ parametric approximations with y in plain English, such as, “people who eat 1 more cookie live 5 minutes shorter on average.”

2. An explanation engine, including what the explanations of observational predictions mean. We already have reasonable implementations of this.

When you have both, you have automated complexity thoughtfully, in a way that empowers people, rather than create a system that enables people to do fancy things badly.

About 85% Problematic: The Trouble With Human-In-The-Loop ML Systems

26 Apr

MIT researchers recently unveiled a system that combines machine learning with input from users to ‘predict 85% of the attacks.’ Each day, the system winnows down millions of rows to a few hundred atypical data points and passes these points on to ‘human experts’ who then label the few hundred data points. The system then uses the labels to refine the algorithm.

At the first blush, using data from users in such a way to refine the algorithm seems like the right thing to do, even the obvious thing to do. And there exist a variety of systems that do precisely this. In the context of cyber data (and a broad category of similar such data), however, it may not be the right thing to do. There are two big reasons for that. A low false positive rate can be much more easily achieved if we do not care about the false negative rate. And there are good reasons to worry a lot about false negative rates in cyber data. And second, and perhaps more importantly, incorporating user input on complex tasks (or where data is insufficiently rich) reduces to the following: given a complex task with inadequate time, the users use cheap heuristics to label the data, and supervised aspect of the algorithm reduces to learning cheap heuristics that humans use.

Clarifai(ng) the Future of Clarifai

31 Dec

Clarifai is a promising AI start-up. In a short(ish) time, it has made major progress on an important problem. And it is rapidly rolling out products with lots of business potential. But there are still some things that it could do.

As I understand it, the base version of Clarifai API is trying to do two things at once: a) learn various recognizable patterns in images b) rank the patterns based on ‘appropriateness’ and probab_true. I think Clarifai would have to split these two things over time and allow people to input what abstract dimensions are ‘appropriate’ for them. As the idiom goes, an image is a thousand words. In an image, there can be information about social class, race, and country, but also shapes, patterns, colors, perspective, depth, time of the day etc. And Clarifai should allow people to pick dimensions appropriate for the task. Though, defining dimensions would be hard. But that shouldn’t stymie the efforts. And ad hoc advances may be useful. For instance, one dimension could be abstract shapes and colors. Another could be the more ‘human’ dimension etc.

Extending the logic, Clarifai should support the building of abstract data science applications that solve a particular problem. For instance, say a user is only interested in learning about whether the photo features a man or a woman. And the user wants to build a Clarifai based classifier. (That person is me. The task is inferring gender of first names. See here.) Clarifai could in principle allow the user to train a classifier that uses all other information in the images, including jewelry, color, perspective, etc. and provide an out of sample error for that particular task. The crucial point is allowing users fuller access to what Clarifai can do and then letting the users manage it at their ends. To that end again, input about user objectives needs to be built into the API. Basic hooks could be developed for classification and clustering inputs.

More generally, Clarifai should eventually support more user inputs and a greater variety of outputs. Limiting the product to tagging is a mistake.

There are three other general directions for Clarifai to go into. A product that automatically sections an image into multiple images and tags each section would be useful. This would allow, for instance, to count the number of women in a photo. Another direction to go would be to provide the ‘best’ set of tags that collectively describe a set of images. (It may seem like violating the spirit of what I noted above but it needn’t — a user could want just this.) By the same token, Clarifai could build general-purpose discrimination engines — a list of tags that distinguishes image(s) the best.

Beyond this, the obvious. Clarifai can also provide synonyms of tags to make tags easier to use. And it could allow users to specify if they want, say tags in ‘UK English’ etc.

I Recommend It: A Recommender System For Scholarship Discovery

1 Oct

The rate of production of scholarship has never been higher. And while our ability to discover relevant scholarship per unit of time has never kept pace with the production of knowledge, it has also risen sharply—most recently, due to Google scholar.

The efficiency of discovery of relevant scholarship, however, has plateaued over the last few years, even as the rate of production of scholarship has kept its steady upward climb. Part of the reason why the growth has plateaued is because current ways of doing things cannot be made considerably more efficient very quickly. New growth in the rate of discovery will need knowledge discovery systems to get more data on the user’s needs, and access to structured databases of academic production.

The easiest next step would perhaps be to build a collaborative recommendation system. In particular, think of a system that takes your .bib file, trawls a large database of citation lists, for example, JSTOR or PLoS, and produces recommendations for scholarship you may have missed. The logic of a collaborative recommender system is pretty simple: learn from articles which have cited similar scholarship as you. If we have metadata on scholarship, for instance, sub-field, the actual text of an article or even the abstract, we could recommend based on the extent to which two articles cite the same kind of scholarly article. Direct elicitations (search terms are but one form) from the user can also be added to guide the recommendations. And meta-characteristics, for instance, page rank of a piece of scholarship, could be used to order recommendations.

Beyond Anomaly Detection: Supervised Learning from `Bad’ Transactions

20 Sep

Nearly every time you connect to the Internet, multiple servers log a bunch of details about the request. For instance, details about the connecting IPs, the protocol being used, etc. (One popular software for collecting such data is Cisco’s Netflow.) Lots of companies analyze this data in an attempt to flag `anomalous’ transactions. Given the data are low quality—IPs do not uniquely map to people, information per transaction is vanishingly small—the chances of building a useful anomaly detection algorithm using conventional unsupervised methods are extremely low.

One way to solve the problem is to re-express it as a supervised problem. Google, various security firms, security researchers, etc. everyday flag a bunch of IPs for various nefarious activities, including hosting malware (passive DNS), scanning, or actively . Check to see if these IPs are in the database, and learn from the transactions that include the blacklisted IPs. Using the model, flag transactions that look most similar to the transactions with blacklisted IPs. And validate the worthiness of flagged transactions with the highest probability of being with a malicious IP by checking to see if the IPs are blacklisted at a future date or by using a subject matter expert.

Some Aspects of Online Learning Re-imagined

3 Sep

Trevor Hastie is a superb teacher. He excels at making complex things simple. Till recently, his lectures were only available to those fortunate enough to be at Stanford. But now, he teaches a free introductory course on statistical learning via the Stanford online learning platform. (I browsed through the online lectures — they are superb.)

Online learning, pregnant with rich possibilities, however, has met its cynics and its first failures. The proportion of students who drop-off after signing up for these courses is astonishing (but so are the sign-up rates). Some preliminary data suggest that compared to some face-to-face teaching, students enrolled in online learning don’t perform as well. But these are early days, and much can be done to refine the systems and promote the interests of thousands upon thousands of teachers who have much to contribute to the world. In that spirit, I offer some suggestions.

Currently, Coursera, edX, and similar such ventures mostly attempt to replicate an off-line course online. The model is reasonable but doesn’t do justice to the unique opportunities of teaching online. 

The Online Learning Platform: For Coursera etc., to be true learning platforms, they need to provide rich APIs for allowing the development of both learning and teaching tools (and easy ways for developers to monetize these applications). For instance, professors could get access to visualization applications, and students could get access to applications that put them in touch with peers interested in peer-to-peer teaching. The possibilities are endless. Integration with other social networks and other communication tools at the discretion of students are obvious future steps.

Learning from Peers: The single teacher model is quaint. It locks out contributions from other teachers. And it locks out teachers from potential revenues. We could turn it into a win-win. Scout ideas and materials from teachers and pay them for their contributions, ideally as a share of the revenue — so that they have a steady income.

The Online Experimentation Platform: So many researchers in education and psychology are curious and willing to contribute to this broad agenda of online learning. But no protocols and platforms have been established to exploit this willingness. The online learning platforms could release more data at one end. And at the other end, they could create a platform for researchers to option ideas that the community and company can vet. Winning ideas can form the basis of research that is implemented on the platform.

Beyond these ideas, there are ideas to do with courses that enhance students’ ability to succeed in these courses. Providing students ‘life skills’ either face-to-face or online, teaching them ways to become more independent, etc., can allow them to better adapt to the opportunities and challenges of learning online.

Toward a Two-Sided Market on Uber

17 Jun

Uber prices rides based on the availability of Uber drivers in the area, the demand, and the destination. This price is the same for whoever is on the other end of the transaction. For instance, Uber doesn’t take into account that someone in a hurry may be willing to pay more for a quicker service. By the same token, Uber doesn’t allow drivers to distinguish themselves on price (Airbnb, for instance, allows this). It leads to a simpler interface but it produces some inefficiency—some needs go unmet, some drivers go underutilized, etc. It may make sense to try out a system that allows for bidding on both sides.

Enabling FOIA: Getting Better Access to Government Data at a Lower Cost

28 May

Freedom of Information Act is vital for holding the government to account. But little effort has gone into building tools that enable fulfillment of FOIA requests. As a consequence of this underinvestment, fulfilling FOIA requests often means onerous, costly work for government agencies and long delays for the requesters. Here, I discuss a couple of alternatives. One tool that can prove particularly effective in lowering the cost of fulfilling FOIA requests is an anonymizer—a tool that detects proper nouns, addresses, telephone numbers, etc. and blurs them. This is easily achieved using modern machine learning methods. To ensure 100% accuracy, humans can quickly vet the suggestions by the algorithm along with ‘suspect words.’ Or a captcha like system that asks people in developing countries to label suspect words as nouns etc. can be created to further reduce costs. This covers one conventional solution. Another way to solve the problem would be to create a sandbox architecture. Rather than give requesters stacks of redacted documents, often unnecessary, one can allow people the right to run certain queries on the data. Results of these queries can be vetted internally. A classic example would be to allow people to query government emails for the total number of times porn domains are accessed via servers at government offices.

The Human and the Machine: Semi-automated approaches to ML

12 Apr

For a class of problems, a combination of algorithms and human input makes for the most optimal solution. For instance, three years ago software to recreate shredded documents that won the DARPA award used “human[s] [to] verify what the computer was recommending.” The insight is used in character recognition tasks. I have used it to create software for matching dirty data — the software was used to merge shape files with electoral returns at precinct level.

The class of problems for which human input proves useful has one essential attribute — humans produce unbiased, if error-prone, estimates for these problems. So for instance, it would be unwise to use humans for making the ‘last mile’ of lending decisions (see also this NYT article). (And that is something you may want to verify with training data.)

Big Data Algorithms: Too Complicated to Communicate?

11 Apr

“A decision is made about you, and you have no idea why it was done,” said Rajeev Date, an investor in data-science lenders and a former deputy director of Consumer Financial Protection Bureau

From NYT: If Algorithms Know All, How Much Should Humans Help?

The assertion that there is no intuition behind decisions made by algorithms strikes me as silly. So does the related assertion that such intuition cannot be communicated effectively. We can back out the logic for most algorithms. Heuristic accounts of the logic — e.g. which variables were important — can be given yet more easily. For instance, for inference from seemingly complicated-to-interpret methods such as ensemble methods, intuition for what variables are important can be gotten in the same way as it is gotten for methods like bagging. However, even when specific points are hard to convey, the meta-logic of the system can be explained to the end user.

What is true, however, is that it isn’t being done. For instance, WSJ covering Orion routing system at UPS reports:

“For example, some drivers don’t understand why it makes sense to deliver a package in one neighborhood in the morning, and come back to the same area later in the day for another delivery. …One driver, who declined to speak for attribution, said he has been on Orion since mid-2014 and dislikes it, because it strikes him as illogical.”

WSJ: At UPS, the Algorithm Is the Driver

Communication architecture is an essential part of all human focused systems. And what to communicate when are important questions that deserve careful thought. The default cannot be no communication.

The lack of systems that communicate intuition behind algorithms strikes me as a great opportunity. HCI people — make some money.

Idealog: Monetizing Websites using Mechanical Turk

14 Jul

Mechanical Turk, a system for crowd-sourcing complex Human Intelligence Tasks (HITs) through parceling a large task and putting the parcels on a marketplace, has seen a recent resurgence, partly on the back of the success of easier to work with clones like crowdflower.com.

Given that on the Internet people find it easier to part with labor than with money, one way to monetize websites may be to request users to help pay for the site by doing a task on Mechanical Turk, made easily available as part of the website. Websites can ask users to fill out the kind of tasks they would like to work on as part of their profiles.

For example, a news website may also provide its users a choice between watching ads and doing a small number of tasks for every X number of articles that they read. Similarly, a news website may ask its users to proofread a sentence or two of its own articles, thereby reducing costs of its production.