American PII: Lapses in Securing Confidential Data

23 Sep

At least 83% of Americans have had their confidential data shared with a company breached (see here and here). The list of most frequently implicated companies in the loss of confidential data makes for sobering reading. Reputable companies like Linkedin (Microsoft), Adobe, Dropbox, etc., are among the top 20 worst offenders. 

Source: Pwned: The Risk of Exposure From Data Breaches

There are two other seemingly contradictory facts. First, many of the companies that haven’t been able to safeguard confidential data have some kind of highly regarded security certification like SOC-2 (see, e.g., here). The second is that many data breaches are caused by elementary errors, e.g., “the password cryptography was poorly done and many were quickly resolved back to plain text” (here).

The explanation for why companies with highly regarded security certifications fail to protect the data is probably mundane. Supporters of these certifications may rightly claim that these certifications dramatically reduce the chances of a breach without eliminating it. And a 1% error rate can easily lead to the observation we started with.

So, how do we secure data? Before discussing solutions, let me describe the current state. In many companies, PII data is spread across multiple databases. Data protection is based on processes set up for controlling access to data. The data may also be encrypted, but it generally isn’t. Many of these processes to secure the data are also auditable and certifications are granted based on audits.

Rather than relying on adherence to processes, a better bet might be to not let PII data percolate across the system. The primary options for prevention are customer-side PII removal and ingestion-time PII removal. (Methods like differential privacy can be used at either end and in how automated data collection services are setup.) Beyond these systems, you need a system for cases where PII data is shown in the product. One way to handle such cases is to build a system where the PII is hashed during ingest and looked up right before serving from a system that is yet more tightly access controlled. All of these things are well known. Their lack of adoption is partly due to the fact that these services have yet to be abstracted out enough that adding them is as easy as editing a YAML file. And there lies an opportunity.

Build Software for the Lay User

14 Feb

Most word processing software helpfully point out grammatical errors and spelling mistakes. Some even autocorrect. And some, like Grammarly, even give style advice. 

Now consider software used for business statistics. Say you want to compute the correlation between two vectors: [100, 2000, 300, 400, 500, 600] and [1, 2, 3, 4, 5, 17000]. Most (all?) software will output .65. (The software assume you want Pearson’s correlation.) Experts know that the relatively large value in the second vector has a large influence on the correlation. For instance, switching it to -17000 will reverse the correlation coefficient to -.65. And if you remove the last observation, the correlation is 1. But a lay user would be none the wiser. Common software, e.g., Excel, R, Stata, Google Sheets, etc., do not warn the user about the outlier and its potential impact on the result. They should.

Take another example—the fickleness of the interpretation of AUC when you have binary predictors (see here) as much depends on how you treat ties. It is an obvious but subtle point. Commonly used statistical software, however, do not warn people about the issue.

Given the rate of increase in the production of knowledge, increasingly everyone is a lay user. For instance, in 2013, Lin showed that estimating ATE using OLS with a full set of interactions improves the precision of ATE. But such analyses are uncommon in economics papers. The analysis could be absent for a variety of reasons: 1. ignorance, 2. difficulty in estimating the model, 3. do not believe the result, etc. However, only ignorance stands the scrutiny. The model is easy to estimate, so the second explanation is unlikely to explain much. The last explanation also seems unlikely, given the result was published in a prominent statistical journal and experts use it.

If ignorance is the primary explanation, should the onus of being well informed about the latest useful discoveries in methods fall on researchers working in a substantive area? Plausibly. But that is clearly not working very well. One way to accelerate the dissemination of useful discoveries is via software, where you can provide such guidance as ‘warnings.’ 

The guidance can be put in manually. Or we can use machine learning, exploiting the strategy used by Grammarly, which uses expert editors to edit lay user sentences and uses that as training data.

We can improve science by building software that provides better guidance. The worst case for such software is probably business-as-usual, where some researchers get bad advice, and many get no advice.