American presidential political campaigns, big construction projects, and big-budget moviemaking have a lot in common. They are all complex enterprises with lots of moving parts, they all bring together lots of people for a short period, and they all need people to hit the ground running and execute in lockstep to succeed. Success in these activities relies a lot on great software and the ability to hire competent people quickly. It remains an open opportunity to build great software for these industries, software that allows people to plan and execute together.
A lot of people have their lives cut short because they eat too much and exercise too little. Worse, the quality of their shortened lives is typically much lower as a result of
avoidable' illnesses that stem frombad behavior.’ And that isn’t all. People who are not feeling great are unlikely to be as productive as those who are. Ill-health also imposes a significant psychological cost on loved ones. The net social cost is likely enormous.
One way to reduce such costly avoidable misery is to invest upfront. Teach people good habits and psychological skills early on, and they will be less likely to self-harm.
So why do we invest so little up front? Especially when we know that people are ill-informed (about the consequences of their actions) and myopic.
Part of the answer is that there are few incentives for anyone else to care. Health insurance companies don’t make their profits by caring. They make them by investing wisely. And by minimizing ‘avoidable’ short-term costs. If a member is unlikely to stick with a health plan for life, why invest in their long-term welfare? Or work to minimize negative externalities that may affect the next generation?
One way to make health insurance care is to rate them on estimated quality-adjusted years saved due to interventions they sponsored. That needs good interventions and good data science. And that is an opportunity. Another way is to get the government to invest heavily early on to address this market failure. Another version would be to get the government to subsidize care that reduces long-term costs.
Clarifai is a promising AI start-up. In a short(ish) time, it has made major progress on an important problem. And it is rapidly rolling out products with lots of business potential. But there are still some things that it could do.
As I understand it, the base version of Clarifai API is trying to do two things at once: a) learn various recognizable patterns in images b) rank the patterns based on ‘appropriateness’ and probab_true. I think Clarifai would have to split these two things over time and allow people to input what abstract dimensions are ‘appropriate’ for them. As the idiom goes, an image is a thousand words. In an image, there can be information about social class, race, and country, but also shapes, patterns, colors, perspective, depth, time of the day etc. And Clarifai should allow people to pick dimensions appropriate for the task. Though, defining dimensions would be hard. But that shouldn’t stymie the efforts. And ad hoc advances may be useful. For instance, one dimension could be abstract shapes and colors. Another could be the more ‘human’ dimension etc.
Extending the logic, Clarifai should support the building of abstract data science applications that solve a particular problem. For instance, say a user is only interested in learning about whether the photo features a man or a woman. And the user wants to build a Clarifai based classifier. (That person is me. The task is inferring gender of first names. See here.) Clarifai could in principle allow the user to train a classifier that uses all other information in the images, including jewelry, color, perspective, etc. and provide an out of sample error for that particular task. The crucial point is allowing users fuller access to what Clarifai can do and then letting the users manage it at their ends. To that end again, input about user objectives needs to be built into the API. Basic hooks could be developed for classification and clustering inputs.
More generally, Clarifai should eventually support more user inputs and a greater variety of outputs. Limiting the product to tagging is a mistake.
There are three other general directions for Clarifai to go into. A product that automatically sections an image into multiple images and tags each section would be useful. This would allow, for instance, to count the number of women in a photo. Another direction to go would be to provide the ‘best’ set of tags that collectively describe a set of images. (It may seem like violating the spirit of what I noted above but it needn’t — a user could want just this.) By the same token, Clarifai could build general-purpose discrimination engines — a list of tags that distinguishes image(s) the best.
Beyond this, the obvious. Clarifai can also provide synonyms of tags to make tags easier to use. And it could allow users to specify if they want, say tags in ‘UK English’ etc.
The rate of production of scholarship has never been higher. And while our ability to discover relevant scholarship per unit of time has never kept pace with the production of knowledge, it has also risen sharply—most recently, due to Google scholar.
The efficiency of discovery of relevant scholarship, however, has plateaued over the last few years, even as the rate of production of scholarship has kept its steady upward climb. Part of the reason why the growth has plateaued is because current ways of doing things cannot be made considerably more efficient very quickly. New growth in the rate of discovery will need knowledge discovery systems to get more data on the user’s needs, and access to structured databases of academic production.
The easiest next step would perhaps be to build a collaborative recommendation system. In particular, think of a system that takes your .bib file, trawls a large database of citation lists, for example, JSTOR or PLoS, and produces recommendations for scholarship you may have missed. The logic of a collaborative recommender system is pretty simple: learn from articles which have cited similar scholarship as you. If we have metadata on scholarship, for instance, sub-field, the actual text of an article or even the abstract, we could recommend based on the extent to which two articles cite the same kind of scholarly article. Direct elicitations (search terms are but one form) from the user can also be added to guide the recommendations. And meta-characteristics, for instance, page rank of a piece of scholarship, could be used to order recommendations.
Trevor Hastie is a superb teacher. He excels at making complex things simple. Till recently, his lectures were only available to those fortunate enough to be at Stanford. But now he teaches a free introductory course on statistical learning via the Stanford online learning platform. (I browsed through the online lectures — they are superb, the quality of his teaching, undiminished.)
Online learning, pregnant with rich possibilities, however, has met its cynics and its first failures. The proportion of students who drop-off after signing up for these courses is astonishing (but then so are the sign-up rates). And some very preliminary data suggest that compared to some face-to-face teaching, students enrolled in online learning don’t perform as well. But these are early days and much can be done to refine the systems and to also promote the interests of thousands upon thousands of teachers who have much to contribute to the world. In that spirit, some suggestions:
Currently, Coursera, edX and similar such ventures mostly attempt to replicate an off-line course online. The model is reasonable but doesn’t do justice to the unique opportunities of teaching online:
The Online Learning Platform: For Coursera etc. to be true learning platforms, they need to provide rich APIs for allowing development of both learning and teaching tools (and easy ways for developers to monetize these applications). For instance, professors could access to visualization applications and students access to applications that puts them in touch with peers interested in peer-to-peer teaching. The possibilities are endless. Integration with other social networks and other communication tools at the discretion of students are obvious future steps.
Learning from Peers: The single teacher model is quaint. It locks out contributions from other teachers. And it locks out teachers from potential revenues. We could turn it into a win-win. Scout ideas and materials from teachers and pay them for their contributions, ideally as a share of the revenue — so that they have a steady income.
The Online Experimentation Platform: So many researchers in education and psychology are curious and willing to contribute to this broad agenda of online learning. But no protocols and platforms have been established to exploit this willingness. The online learning platforms could release more data at one end. And at the other end, they could create a platform for researchers to option ideas that the community and company can vet. Winning ideas can form the basis of research that is implemented on the platform.
Beyond these ideas, there are ideas to do with courses that enhance student’s ability to succeed in these courses. Providing students ‘life skills’ either face-to-face or online, teaching them ways to become more independent students, can allow them to better adapt to the opportunities and challenges of learning online.
Uber prices rides based on the availability of Uber drivers in the area, the demand, and the destination. This price is the same for whoever is on the other end of the transaction. For instance, Uber doesn’t take into account that someone in a hurry may be willing to pay more for a quicker service. By the same token, Uber doesn’t allow drivers to distinguish themselves on price (Airbnb, for instance, allows this). It leads to a simpler interface but it produces some inefficiency—some needs go unmet, some drivers go underutilized, etc. It may make sense to try out a system that allows for bidding on both sides.
Freedom of Information Act is vital for holding the government to account. But little effort has gone into building tools that enable fulfillment of FOIA requests. As a consequence of this underinvestment, fulfilling FOIA requests often means onerous, costly work for government agencies and long delays for the requesters. Here, I discuss a couple of alternatives. One tool that can prove particularly effective in lowering the cost of fulfilling FOIA requests is an anonymizer—a tool that detects proper nouns, addresses, telephone numbers, etc. and blurs them. This is easily achieved using modern machine learning methods. To ensure 100% accuracy, humans can quickly vet the suggestions by the algorithm along with ‘suspect words.’ Or a captcha like system that asks people in developing countries to label suspect words as nouns etc. can be created to further reduce costs. This covers one conventional solution. Another way to solve the problem would be to create a sandbox architecture. Rather than give requesters stacks of redacted documents, often unnecessary, one can allow people the right to run certain queries on the data. Results of these queries can be vetted internally. A classic example would be to allow people to query government emails for the total number of times porn domains are accessed via servers at government offices.
Recent Wikileaks episode has highlighted the immense control national governments and private companies have on what content can be hosted. Within days of being identified by the U.S. government as a problem, private companies in charge of hosting and providing banking services to Wikileaks withdrew support, largely neutering organization’s ability to raise funds, and host content.
Successful attempts to cut Internet in Egypt and Libya also pose questions of a similar nature.
So two questions follow. Should anything be done about it? And if so, what? The answer to the first is not as clear, but on balance, perhaps such absolute discretionary control over the fate of ‘hostile’ information/or technology should not be the allowed. As to the second question, given many of the hosting, banking companies, etc. essential to disseminating content are privately held, and susceptible to both government and market pressures, dissemination engine ought to be independent of those as much as possible (bottlenecks remain: most pipes are owned by governments or corporations). Here are three ideas:
- Create an international server farm on which content can be hosted by anyone but only removed after due process, set internationally. (NGO supported farms may work as well.)
- We already have ways to disseminate content without centralized hosting—P2P. But these systems lack a browser that collates torrents and builds a webpage in live time. Such a torrent based browser can vastly improve the ability of P2P networks to host content.
- For Libya/Egypt etc. the problem is of a different nature. We need applications like Twitter to continue to function even if the artery to central servers goes down. This can be handled by building applications in a manner that they can be run on edge servers with local data. I believe this kind of redundancy can also be useful for businesses.
Recruiting diverse samples on the Internet is a tough business. Even if one uses probability sampling, one must still think of creative ways of incentivizing response. For example, Knowledge Networks samples uses RDD (one can use ABS) and then entices selected respondents with free Internet access (or money) in exchange for filling out surveys each month.
One can extend that concept in the following manner – there are a variety of places where people must wait or have time to spare, where they also increasingly have their laptops and their cell-phones, for example, the airport, the airplane, the dentist’s office, the DMV, hotel room, etc., and where they look to browse the Internet without paying any money. There are other places where people just want access to the Internet without paying (say cafes). Hence, one way to recruit people for surveys would be to give free temporary Internet access or local coupon(s) in return for filling out survey(s).
One can extend this method to ‘buying’ temporary Internet access to buying any sort of product (e.g. content) or service.
While many of the articles and judicial opinions are never cited, a system can be established to build summaries of those that are. The idea is straightforward- harvest sentences just before the citation in academic articles citing the article of interest. One can learn from the summaries of the articles citing the articles, or from the summaries of other co-cited articles. The framework can be applied to judicial opinions, and to tally positions by politicians/institutions over a topic, etc.
This form of “summarization” utilizes intellectual work done by others, and allows for tracking major rebuttals and weaknesses identified by others. On the downside, summarization is necessarily limited to points made by others. Either way, the method is likely to produce superior results than methods that harvest ‘topic sentences’ etc.
Update: In 2015, I released the first draft of autosum.