Software engineering has changed dramatically in the last few decades. The rise of AWS, high-level languages, powerful libraries, and frameworks increasingly allow engineers to focus on business logic. Today, software engineers spend much of their time writing code that reasons over data to show something or do something. But how engineering is done has not caught up in some crucial ways:
- Software Development Tools. Most data scientists today work in a notebook on a server where they heavily interact with the data as they refine the code (algorithm). Most engineers still work locally without access to production data. Part of the reason engineers don’t have access to the data is because they work locally—for security and compliance reasons, access to production data from the local machine is banned in most places. Plausibly, a bigger reason is that engineers are stuck in a paradigm where they don’t think access to production data is foundational to faster, higher-quality software development. This belief is reflected in the ad-hoc solutions to the problem that are being tried across the industry, e.g., synthetic data (which is hard to create, maintain, and scale).
- Data Modeling. The focus on data modeling has sharply decreased over time in many companies. There are at least four underlying forces behind this trend. First, the combination of the volume of the data being generated and the rise of cheap blob storage (combined with the fact that computing power is comparatively vastly more expensive today) incentivizes the storage of unstructured data. Second, agile development, which prioritizes customer-facing progress over short time units, may cause underinvestment in costly, foundational work (see here). Third, the engineering organizations are changing in that the producers of the data are no longer seen as owners of the data. The fourth and last point is perhaps the most crucial—the surfeit of data has led to some magical thinking about the ease with which data can be used to power insights. Our ability to derive business insights from unstructured and dirty data, except for a small minority of cases, e.g., search, doesn’t exist. The only thing the surfeit of data has done is that it has widened and deepened the pool of insights that can be delivered. It hasn’t made it any easier to derive those insights, which continue to rely on good old-fashioned manual work to understand the use case and curate and structure the data appropriately. (It also then becomes an opportunity for building software.)
Engineers pay the price of not investing in data modeling by making the code more complex (and hence, more unmaintainable) and by allocating time to fix “bugs.” (The reason I put the word bugs in air quotes is because obvious consequences of a bad system should not be called bugs.) - Data Drift. Machine Learning Engineers (MLEs) obsess about it. Most other engineers haven’t ever heard of the term. Everyone should worry. Technically, the only difference between using ML and engineering for rule creation is that ML auto-creates rules while conventional engineering relies on handcrafting the rules. Both systems test the efficacy of their rules on the current data. Both systems assume that the data will not drift. Only MLEs monitor the data, thinking hard about what data the rules work for and how to monitor data drift. Other engineers need to sign up.
The solutions are as simple as the problems are immense: invest in data quality, data monitoring, and data models. To achieve that, we need to change how organizations are structured, how they are run, and what engineers think the hard problems are.