Coding At a Distance
Senior professors have long lived the future of coding. Professors ask. Underlings code. Professors critique the output, e.g., "I prefer statistical significance stars for the betas but not the constant term." Underlings revise the code. The cycle continues.
Something similar can play out at companies.
Now, thanks to LLMs, we can all adopt this working style.*
The style’s success depends on a reasonably competent underling and the ability of the user to ask the right questions. Data suggest that LLMs are reasonably competent at standard coding tasks (though their performance on econometric tasks needs more validation). The ability of the user to ask the right questions is the bottleneck, which means experts can benefit more reliably from the system. For instance, think of an analyst who asks the LLM to regress Y on X. The agent duly complies. Except that the outcome variable is heavily skewed, and the interpretation of the results hinges on knowing that. If the user doesn't ask for the distribution of the outcome variable or figure out an alternate specification that sheds light on the skew, the user would be none the wiser. (Most statistical software (which LLMs would call on) doesn't interrogate all the assumptions or provide artefacts that shed light on the interpretation.)
There are at least three solutions to the quandary. The first is using SFT or RLHF to build a facsimile of a defensive data scientist or developer that interrogates assumptions as part of its ‘reasoning.’ The second is to change the statistical software to perform defensive analysis. For instance, a Pearson’s correlation calculator could be coded to warn the user (emit a message that the LLM sees when running the program in the sandbox) about skewed data. The third is to use RLHF to design LLMs that highlight the assumptions and common modes of failure as part of analytical outputs. Designing agents that provide a comprehensive test suite (we need good benchmark data on the ability to provide comprehensive test coverage) may be another way to help matters.
*You don't need to resort to the senior professor’s working style. For a class of problems for which success is well-defined, as with supervised ML, you can hook the proposal and refinement regime with an evaluation framework. This kind of architecture goes back to at least auto-ML.