At least 83% of Americans have had their confidential data shared with a company breached (see here and here). The list of most frequently implicated companies in the loss of confidential data makes for sobering reading. Reputable companies like Linkedin (Microsoft), Adobe, Dropbox, etc., are among the top 20 worst offenders.
There are two other seemingly contradictory facts. First, many of the companies that haven’t been able to safeguard confidential data have some kind of highly regarded security certification like SOC-2 (see, e.g., here). The second is that many data breaches are caused by elementary errors, e.g., “the password cryptography was poorly done and many were quickly resolved back to plain text” (here).
The explanation for why companies with highly regarded security certifications fail to protect the data is probably mundane. Supporters of these certifications may rightly claim that these certifications dramatically reduce the chances of a breach without eliminating it. And a 1% error rate can easily lead to the observation we started with.
So, how do we secure data? Before discussing solutions, let me describe the current state. In many companies, PII data is spread across multiple databases. Data protection is based on processes set up for controlling access to data. The data may also be encrypted, but it generally isn’t. Many of these processes to secure the data are also auditable and certifications are granted based on audits.
Rather than relying on adherence to processes, a better bet might be to not let PII data percolate across the system. The primary options for prevention are customer-side PII removal and ingestion-time PII removal. (Methods like differential privacy can be used at either end and in how automated data collection services are setup.) Beyond these systems, you need a system for cases where PII data is shown in the product. One way to handle such cases is to build a system where the PII is hashed during ingest and looked up right before serving from a system that is yet more tightly access controlled. All of these things are well known. Their lack of adoption is partly due to the fact that these services have yet to be abstracted out enough that adding them is as easy as editing a YAML file. And there lies an opportunity.