Digest: Machine Learning: The High-Interest Credit Card of Technical Debt Paper

By: D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young

Coming from a software background, I was very interested in this paper as it applies the concept of technical debt, often discussed in the software world, to machine learning. The basic gist of the paper is that like software, machine learning takes on technical debt as well. Not all technical debt is bad but being able to analyze where you’re taking on debt and what it costs is a great skill to have.

…imagine we have a system that uses features $x_1$, …, $x_n$ in a model. If we change the input distribution of values in $x_{1}$, the importance, weights, or use of the remaining $n−1$ features may all change—this is true whether the model is retrained fully in a batch style or allowed to adapt in an online fashion. Adding a new feature $x_{n+1}$ can cause similar changes, as can removing any feature $x_j$. No inputs are ever really independent. We refer to this as the CACE principle: “Changing Anything Changes Everything”

I’ve heard of this principle before but really liked how simple they present this. Especially where they point out all the weights of your model will can by adding or removing another feature.

If this font-size module starts consuming CTR as an input signal, and font-size has an effect on user propensity to click, then the inclusion of CTR in font-size adds a new hidden feedback loop. It’s easy to imagine a case where such a system would gradually and endlessly increase the size of all headlines.

This proved to be an interesting scenario where a hidden feedback loop would have real life consequences. I imagine someone could receive reports that their website is entirely in size $72$ font an not have the faintest idea why.

One common mitigation strategy for unstable data dependencies is to create a versioned copy of a given signal. For example, rather than allowing a semantic mapping of words to topic clusters to change over time, it might be reasonable to create a frozen version of this mapping and use it until such a time as an updated version has been fully vetted. Versioning carries its own costs, however, such as potential staleness. And the requirement to maintain multiple versions of the same signal over time is a contributor to technical debt in its own right.

This poses an interesting question that I don’t there’s a blatant right or wrong answer here. Depending on the size of your signals, the storage of copies may be significant. With storage costs being low today that may not be an issue but worth noting. I think the real payout for using a versioned copy of a signal would be for development and analysis, as you could learn about the versioned data and see if what you’ve learned still applies to later versions.

It may be surprising to the academic community to know that only a tiny fraction of the code in many machine learning systems is actually doing “machine learning”. When we recognize that a mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code, re-implementation rather than reuse of a clumsy API looks like a much better strategy.

This is my favorite quote from the paper as it demystifies the notion that its common for someone to spend most of their day during pure machine learning. There’s data capturing, analysis, cleaning, infrastructure, writing some sort of code to expose your model such as as a REST API, testing, monitoring, etc….

Also, in a mature system which is being actively developed, the number of lines of configuration can far exceed the number of lines of the code that actually does machine learning. Each line has a potential for mistakes, and configurations are by their nature ephemeral and less well tested.

This quote is very close to my heart due to experiences I had at a prior employer. I’ve seen configuration files in the thousands of lines where no single person understands the entire file. The paper goes into a real example of how configuration item A could disable the next five items, while configuration B could modify how C is interpreted. These sorts of side effects really have to have documentation and not rely on the creator to recall or even be around.

Overall I really enjoyed the paper and look forward to more papers that analyze using machine learning models in production and problems one may encounter.