{"title": "Hidden Technical Debt in Machine Learning Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 2503, "page_last": 2511, "abstract": "Machine learning offers a fantastically powerful toolkit for building useful complexprediction systems quickly. This paper argues it is dangerous to think ofthese quick wins as coming for free. Using the software engineering frameworkof technical debt, we find it is common to incur massive ongoing maintenancecosts in real-world ML systems. We explore several ML-specific risk factors toaccount for in system design. These include boundary erosion, entanglement,hidden feedback loops, undeclared consumers, data dependencies, configurationissues, changes in the external world, and a variety of system-level anti-patterns.", "full_text": "Hidden Technical Debt in Machine Learning Systems\n\nD. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips\n\n{dsculley,gholt,dgg,edavydov,toddphillips}@google.com\n\nGoogle, Inc.\n\nDietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Franc\u00b8ois Crespo, Dan Dennison\n\n{ebner,vchaudhary,mwyoung,jfcrespo,dennison}@google.com\n\nGoogle, Inc.\n\nAbstract\n\nMachine learning offers a fantastically powerful toolkit for building useful com-\nplex prediction systems quickly. This paper argues it is dangerous to think of\nthese quick wins as coming for free. Using the software engineering framework\nof technical debt, we \ufb01nd it is common to incur massive ongoing maintenance\ncosts in real-world ML systems. We explore several ML-speci\ufb01c risk factors to\naccount for in system design. These include boundary erosion, entanglement,\nhidden feedback loops, undeclared consumers, data dependencies, con\ufb01guration\nissues, changes in the external world, and a variety of system-level anti-patterns.\n\n1\n\nIntroduction\n\nAs the machine learning (ML) community continues to accumulate years of experience with live\nsystems, a wide-spread and uncomfortable trend has emerged: developing and deploying ML sys-\ntems is relatively fast and cheap, but maintaining them over time is dif\ufb01cult and expensive.\n\nThis dichotomy can be understood through the lens of technical debt, a metaphor introduced by\nWard Cunningham in 1992 to help reason about the long term costs incurred by moving quickly in\nsoftware engineering. As with \ufb01scal debt, there are often sound strategic reasons to take on technical\ndebt. Not all debt is bad, but all debt needs to be serviced. Technical debt may be paid down\nby refactoring code, improving unit tests, deleting dead code, reducing dependencies, tightening\nAPIs, and improving documentation [8]. The goal is not to add new functionality, but to enable\nfuture improvements, reduce errors, and improve maintainability. Deferring such payments results\nin compounding costs. Hidden debt is dangerous because it compounds silently.\n\nIn this paper, we argue that ML systems have a special capacity for incurring technical debt, because\nthey have all of the maintenance problems of traditional code plus an additional set of ML-speci\ufb01c\nissues. This debt may be dif\ufb01cult to detect because it exists at the system level rather than the code\nlevel. Traditional abstractions and boundaries may be subtly corrupted or invalidated by the fact that\ndata in\ufb02uences ML system behavior. Typical methods for paying down code level technical debt are\nnot suf\ufb01cient to address ML-speci\ufb01c technical debt at the system level.\n\nThis paper does not offer novel ML algorithms, but instead seeks to increase the community\u2019s aware-\nness of the dif\ufb01cult tradeoffs that must be considered in practice over the long term. We focus on\nsystem-level interactions and interfaces as an area where ML technical debt may rapidly accumulate.\nAt a system-level, an ML model may silently erode abstraction boundaries. The tempting re-use or\nchaining of input signals may unintentionally couple otherwise disjoint systems. ML packages may\nbe treated as black boxes, resulting in large masses of \u201cglue code\u201d or calibration layers that can lock\nin assumptions. Changes in the external world may in\ufb02uence system behavior in unintended ways.\nEven monitoring ML system behavior may prove dif\ufb01cult without careful design.\n\n1\n\n\f2 Complex Models Erode Boundaries\n\nTraditional software engineering practice has shown that strong abstraction boundaries using en-\ncapsulation and modular design help create maintainable code in which it is easy to make isolated\nchanges and improvements. Strict abstraction boundaries help express the invariants and logical\nconsistency of the information inputs and outputs from an given component [8].\n\nUnfortunately, it is dif\ufb01cult to enforce strict abstraction boundaries for machine learning systems\nby prescribing speci\ufb01c intended behavior. Indeed, ML is required in exactly those cases when the\ndesired behavior cannot be effectively expressed in software logic without dependency on external\ndata. The real world does not \ufb01t into tidy encapsulation. Here we examine several ways that the\nresulting erosion of boundaries may signi\ufb01cantly increase technical debt in ML systems.\n\nEntanglement. Machine learning systems mix signals together, entangling them and making iso-\nlation of improvements impossible. For instance, consider a system that uses features x1, ...xn in\na model. If we change the input distribution of values in x1, the importance, weights, or use of\nthe remaining n \u2212 1 features may all change. This is true whether the model is retrained fully in a\nbatch style or allowed to adapt in an online fashion. Adding a new feature xn+1 can cause similar\nchanges, as can removing any feature xj. No inputs are ever really independent. We refer to this here\nas the CACE principle: Changing Anything Changes Everything. CACE applies not only to input\nsignals, but also to hyper-parameters, learning settings, sampling methods, convergence thresholds,\ndata selection, and essentially every other possible tweak.\n\nOne possible mitigation strategy is to isolate models and serve ensembles. This approach is useful\nin situations in which sub-problems decompose naturally such as in disjoint multi-class settings like\n[14]. However, in many cases ensembles work well because the errors in the component models are\nuncorrelated. Relying on the combination creates a strong entanglement: improving an individual\ncomponent model may actually make the system accuracy worse if the remaining errors are more\nstrongly correlated with the other components.\n\nA second possible strategy is to focus on detecting changes in prediction behavior as they occur.\nOne such method was proposed in [12], in which a high-dimensional visualization tool was used to\nallow researchers to quickly see effects across many dimensions and slicings. Metrics that operate\non a slice-by-slice basis may also be extremely useful.\n\nCorrection Cascades. There are often situations in which model ma for problem A exists, but a\nsolution for a slightly different problem A\u2032 is required. In this case, it can be tempting to learn a\nmodel m\u2032\n\na that takes ma as input and learns a small correction as a fast way to solve the problem.\n\nHowever, this correction model has created a new system dependency on ma, making it signi\ufb01cantly\nmore expensive to analyze improvements to that model in the future. The cost increases when\ncorrection models are cascaded, with a model for problem A\u2032\u2032 learned on top of m\u2032\na, and so on,\nfor several slightly different test distributions. Once in place, a correction cascade can create an\nimprovement deadlock, as improving the accuracy of any individual component actually leads to\nsystem-level detriments. Mitigation strategies are to augment ma to learn the corrections directly\nwithin the same model by adding features to distinguish among the cases, or to accept the cost of\ncreating a separate model for A\u2032.\n\nUndeclared Consumers. Oftentimes, a prediction from a machine learning model ma is made\nwidely accessible, either at runtime or by writing to \ufb01les or logs that may later be consumed by\nother systems. Without access controls, some of these consumers may be undeclared, silently using\nthe output of a given model as an input to another system. In more classical software engineering,\nthese issues are referred to as visibility debt [13].\n\nUndeclared consumers are expensive at best and dangerous at worst, because they create a hidden\ntight coupling of model ma to other parts of the stack. Changes to ma will very likely impact these\nother parts, potentially in ways that are unintended, poorly understood, and detrimental. In practice,\nthis tight coupling can radically increase the cost and dif\ufb01culty of making any changes to ma at all,\neven if they are improvements. Furthermore, undeclared consumers may create hidden feedback\nloops, which are described more in detail in section 4.\n\n2\n\n\fUndeclared consumers may be dif\ufb01cult to detect unless the system is speci\ufb01cally designed to guard\nagainst this case, for example with access restrictions or strict service-level agreements (SLAs). In\nthe absence of barriers, engineers will naturally use the most convenient signal at hand, especially\nwhen working against deadline pressures.\n\n3 Data Dependencies Cost More than Code Dependencies\n\nIn [13], dependency debt is noted as a key contributor to code complexity and technical debt in\nclassical software engineering settings. We have found that data dependencies in ML systems carry\na similar capacity for building debt, but may be more dif\ufb01cult to detect. Code dependencies can be\nidenti\ufb01ed via static analysis by compilers and linkers. Without similar tooling for data dependencies,\nit can be inappropriately easy to build large data dependency chains that can be dif\ufb01cult to untangle.\n\nUnstable Data Dependencies. To move quickly, it is often convenient to consume signals as input\nfeatures that are produced by other systems. However, some input signals are unstable, meaning\nthat they qualitatively or quantitatively change behavior over time. This can happen implicitly,\nwhen the input signal comes from another machine learning model itself that updates over time,\nor a data-dependent lookup table, such as for computing TF/IDF scores or semantic mappings. It\ncan also happen explicitly, when the engineering ownership of the input signal is separate from the\nengineering ownership of the model that consumes it. In such cases, updates to the input signal\nmay be made at any time. This is dangerous because even \u201cimprovements\u201d to input signals may\nhave arbitrary detrimental effects in the consuming system that are costly to diagnose and address.\nFor example, consider the case in which an input signal was previously mis-calibrated. The model\nconsuming it likely \ufb01t to these mis-calibrations, and a silent update that corrects the signal will have\nsudden rami\ufb01cations for the model.\n\nOne common mitigation strategy for unstable data dependencies is to create a versioned copy of a\ngiven signal. For example, rather than allowing a semantic mapping of words to topic clusters to\nchange over time, it might be reasonable to create a frozen version of this mapping and use it until\nsuch a time as an updated version has been fully vetted. Versioning carries its own costs, however,\nsuch as potential staleness and the cost to maintain multiple versions of the same signal over time.\n\nIn code, underutilized dependencies are packages that are\nUnderutilized Data Dependencies.\nmostly unneeded [13]. Similarly, underutilized data dependencies are input signals that provide\nlittle incremental modeling bene\ufb01t. These can make an ML system unnecessarily vulnerable to\nchange, sometimes catastrophically so, even though they could be removed with no detriment.\n\nAs an example, suppose that to ease the transition from an old product numbering scheme to new\nproduct numbers, both schemes are left in the system as features. New products get only a new\nnumber, but old products may have both and the model continues to rely on the old numbers for\nsome products. A year later, the code that stops populating the database with the old numbers is\ndeleted. This will not be a good day for the maintainers of the ML system.\n\nUnderutilized data dependencies can creep into a model in several ways.\n\n\u2022 Legacy Features. The most common case is that a feature F is included in a model early in\nits development. Over time, F is made redundant by new features but this goes undetected.\n\u2022 Bundled Features. Sometimes, a group of features is evaluated and found to be bene\ufb01cial.\nBecause of deadline pressures or similar effects, all the features in the bundle are added to\nthe model together, possibly including features that add little or no value.\n\n\u2022 \u01eb-Features. As machine learning researchers, it is tempting to improve model accuracy\neven when the accuracy gain is very small or when the complexity overhead might be high.\n\u2022 Correlated Features. Often two features are strongly correlated, but one is more directly\ncausal. Many ML methods have dif\ufb01culty detecting this and credit the two features equally,\nor may even pick the non-causal one. This results in brittleness if world behavior later\nchanges the correlations.\n\nUnderutilized dependencies can be detected via exhaustive leave-one-feature-out evaluations. These\nshould be run regularly to identify and remove unnecessary features.\n\n3\n\n\fFigure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown\nby the small black box in the middle. The required surrounding infrastructure is vast and complex.\n\nIn traditional code, compilers and build systems perform\nStatic Analysis of Data Dependencies.\nstatic analysis of dependency graphs. Tools for static analysis of data dependencies are far less\ncommon, but are essential for error checking, tracking down consumers, and enforcing migration\nand updates. One such tool is the automated feature management system described in [12], which\nenables data sources and features to be annotated. Automated checks can then be run to ensure that\nall dependencies have the appropriate annotations, and dependency trees can be fully resolved. This\nkind of tooling can make migration and deletion much safer in practice.\n\n4 Feedback Loops\n\nOne of the key features of live ML systems is that they often end up in\ufb02uencing their own behavior\nif they update over time. This leads to a form of analysis debt, in which it is dif\ufb01cult to predict the\nbehavior of a given model before it is released. These feedback loops can take different forms, but\nthey are all more dif\ufb01cult to detect and address if they occur gradually over time, as may be the case\nwhen models are updated infrequently.\n\nDirect Feedback Loops. A model may directly in\ufb02uence the selection of its own future training\ndata. It is common practice to use standard supervised algorithms, although the theoretically correct\nsolution would be to use bandit algorithms. The problem here is that bandit algorithms (such as\ncontextual bandits [9]) do not necessarily scale well to the size of action spaces typically required for\nreal-world problems. It is possible to mitigate these effects by using some amount of randomization\n[3], or by isolating certain parts of data from being in\ufb02uenced by a given model.\n\nHidden Feedback Loops. Direct feedback loops are costly to analyze, but at least they pose a\nstatistical challenge that ML researchers may \ufb01nd natural to investigate [3]. A more dif\ufb01cult case is\nhidden feedback loops, in which two systems in\ufb02uence each other indirectly through the world.\n\nOne example of this may be if two systems independently determine facets of a web page, such as\none selecting products to show and another selecting related reviews. Improving one system may\nlead to changes in behavior in the other, as users begin clicking more or less on the other components\nin reaction to the changes. Note that these hidden loops may exist between completely disjoint\nsystems. Consider the case of two stock-market prediction models from two different investment\ncompanies. Improvements (or, more scarily, bugs) in one may in\ufb02uence the bidding and buying\nbehavior of the other.\n\n5 ML-System Anti-Patterns\n\nIt may be surprising to the academic community to know that only a tiny fraction of the code in\nmany ML systems is actually devoted to learning or prediction \u2013 see Figure 1. In the language of\nLin and Ryaboy, much of the remainder may be described as \u201cplumbing\u201d [11].\n\nIt is unfortunately common for systems that incorporate machine learning methods to end up with\nhigh-debt design patterns. In this section, we examine several system-design anti-patterns [4] that\ncan surface in machine learning systems and which should be avoided or refactored where possible.\n\n4\n\n\fGlue Code. ML researchers tend to develop general purpose solutions as self-contained packages.\nA wide variety of these are available as open-source packages at places like mloss.org, or from\nin-house code, proprietary packages, and cloud-based platforms.\n\nUsing generic packages often results in a glue code system design pattern, in which a massive\namount of supporting code is written to get data into and out of general-purpose packages. Glue\ncode is costly in the long term because it tends to freeze a system to the peculiarities of a speci\ufb01c\npackage; testing alternatives may become prohibitively expensive.\nIn this way, using a generic\npackage can inhibit improvements, because it makes it harder to take advantage of domain-speci\ufb01c\nproperties or to tweak the objective function to achieve a domain-speci\ufb01c goal. Because a mature\nsystem might end up being (at most) 5% machine learning code and (at least) 95% glue code, it may\nbe less costly to create a clean native solution rather than re-use a generic package.\n\nAn important strategy for combating glue-code is to wrap black-box packages into common API\u2019s.\nThis allows supporting infrastructure to be more reusable and reduces the cost of changing packages.\n\nPipeline Jungles. As a special case of glue code, pipeline jungles often appear in data prepara-\ntion. These can evolve organically, as new signals are identi\ufb01ed and new information sources added\nincrementally. Without care, the resulting system for preparing data in an ML-friendly format may\nbecome a jungle of scrapes, joins, and sampling steps, often with intermediate \ufb01les output. Man-\naging these pipelines, detecting errors and recovering from failures are all dif\ufb01cult and costly [1].\nTesting such pipelines often requires expensive end-to-end integration tests. All of this adds to\ntechnical debt of a system and makes further innovation more costly.\n\nPipeline jungles can only be avoided by thinking holistically about data collection and feature ex-\ntraction. The clean-slate approach of scrapping a pipeline jungle and redesigning from the ground\nup is indeed a major investment of engineering effort, but one that can dramatically reduce ongoing\ncosts and speed further innovation.\n\nGlue code and pipeline jungles are symptomatic of integration issues that may have a root cause in\noverly separated \u201cresearch\u201d and \u201cengineering\u201d roles. When ML packages are developed in an ivory-\ntower setting, the result may appear like black boxes to the teams that employ them in practice. A\nhybrid research approach where engineers and researchers are embedded together on the same teams\n(and indeed, are often the same people) can help reduce this source of friction signi\ufb01cantly [16].\n\nDead Experimental Codepaths. A common consequence of glue code or pipeline jungles is that\nit becomes increasingly attractive in the short term to perform experiments with alternative methods\nby implementing experimental codepaths as conditional branches within the main production code.\nFor any individual change, the cost of experimenting in this manner is relatively low\u2014none of the\nsurrounding infrastructure needs to be reworked. However, over time, these accumulated codepaths\ncan create a growing debt due to the increasing dif\ufb01culties of maintaining backward compatibility\nand an exponential increase in cyclomatic complexity. Testing all possible interactions between\ncodepaths becomes dif\ufb01cult or impossible. A famous example of the dangers here was Knight\nCapital\u2019s system losing $465 million in 45 minutes, apparently because of unexpected behavior\nfrom obsolete experimental codepaths [15].\n\nAs with the case of dead \ufb02ags in traditional software [13], it is often bene\ufb01cial to periodically re-\nexamine each experimental branch to see what can be ripped out. Often only a small subset of the\npossible branches is actually used; many others may have been tested once and abandoned.\n\nAbstraction Debt. The above issues highlight the fact that there is a distinct lack of strong ab-\nstractions to support ML systems. Zheng recently made a compelling comparison of the state ML\nabstractions to the state of database technology [17], making the point that nothing in the machine\nlearning literature comes close to the success of the relational database as a basic abstraction. What\nis the right interface to describe a stream of data, or a model, or a prediction?\n\nFor distributed learning in particular, there remains a lack of widely accepted abstractions. It could\nbe argued that the widespread use of Map-Reduce in machine learning was driven by the void of\nstrong distributed learning abstractions. Indeed, one of the few areas of broad agreement in recent\nyears appears to be that Map-Reduce is a poor abstraction for iterative ML algorithms.\n\n5\n\n\fThe parameter-server abstraction seems much more robust, but there are multiple competing speci-\n\ufb01cations of this basic idea [5, 10]. The lack of standard abstractions makes it all too easy to blur the\nlines between components.\n\nIn software engineering, a design smell may indicate an underlying problem in\nCommon Smells.\na component or system [7]. We identify a few ML system smells, not hard-and-fast rules, but as\nsubjective indicators.\n\n\u2022 Plain-Old-Data Type Smell. The rich information used and produced by ML systems is\nall to often encoded with plain data types like raw \ufb02oats and integers. In a robust system,\na model parameter should know if it is a log-odds multiplier or a decision threshold, and a\nprediction should know various pieces of information about the model that produced it and\nhow it should be consumed.\n\n\u2022 Multiple-Language Smell. It is often tempting to write a particular piece of a system in\na given language, especially when that language has a convenient library or syntax for the\ntask at hand. However, using multiple languages often increases the cost of effective testing\nand can increase the dif\ufb01culty of transferring ownership to other individuals.\n\n\u2022 Prototype Smell. It is convenient to test new ideas in small scale via prototypes. How-\never, regularly relying on a prototyping environment may be an indicator that the full-scale\nsystem is brittle, dif\ufb01cult to change, or could bene\ufb01t from improved abstractions and inter-\nfaces. Maintaining a prototyping environment carries its own cost, and there is a signi\ufb01cant\ndanger that time pressures may encourage a prototyping system to be used as a production\nsolution. Additionally, results found at small scale rarely re\ufb02ect the reality at full scale.\n\n6 Con\ufb01guration Debt\n\nAnother potentially surprising area where debt can accumulate is in the con\ufb01guration of machine\nlearning systems. Any large system has a wide range of con\ufb01gurable options, including which\nfeatures are used, how data is selected, a wide variety of algorithm-speci\ufb01c learning settings, poten-\ntial pre- or post-processing, veri\ufb01cation methods, etc. We have observed that both researchers and\nengineers may treat con\ufb01guration (and extension of con\ufb01guration) as an afterthought. Indeed, veri-\n\ufb01cation or testing of con\ufb01gurations may not even be seen as important. In a mature system which is\nbeing actively developed, the number of lines of con\ufb01guration can far exceed the number of lines of\nthe traditional code. Each con\ufb01guration line has a potential for mistakes.\n\nConsider the following examples. Feature A was incorrectly logged from 9/14 to 9/17. Feature B is\nnot available on data before 10/7. The code used to compute feature C has to change for data before\nand after 11/1 because of changes to the logging format. Feature D is not available in production, so\na substitute features D\u2032 and D\u2032\u2032 must be used when querying the model in a live setting. If feature\nZ is used, then jobs for training must be given extra memory due to lookup tables or they will train\ninef\ufb01ciently. Feature Q precludes the use of feature R because of latency constraints.\n\nAll this messiness makes con\ufb01guration hard to modify correctly, and hard to reason about. How-\never, mistakes in con\ufb01guration can be costly, leading to serious loss of time, waste of computing\nresources, or production issues. This leads us to articulate the following principles of good con\ufb01gu-\nration systems:\n\n\u2022 It should be easy to specify a con\ufb01guration as a small change from a previous con\ufb01guration.\n\n\u2022 It should be hard to make manual errors, omissions, or oversights.\n\n\u2022 It should be easy to see, visually, the difference in con\ufb01guration between two models.\n\n\u2022 It should be easy to automatically assert and verify basic facts about the con\ufb01guration:\n\nnumber of features used, transitive closure of data dependencies, etc.\n\n\u2022 It should be possible to detect unused or redundant settings.\n\n\u2022 Con\ufb01gurations should undergo a full code review and be checked into a repository.\n\n6\n\n\f7 Dealing with Changes in the External World\n\nOne of the things that makes ML systems so fascinating is that they often interact directly with the\nexternal world. Experience has shown that the external world is rarely stable. This background rate\nof change creates ongoing maintenance cost.\n\nIt is often necessary to pick a decision threshold for a\nFixed Thresholds in Dynamic Systems.\ngiven model to perform some action: to predict true or false, to mark an email as spam or not spam,\nto show or not show a given ad. One classic approach in machine learning is to choose a threshold\nfrom a set of possible thresholds, in order to get good tradeoffs on certain metrics, such as precision\nand recall. However, such thresholds are often manually set. Thus if a model updates on new data,\nthe old manually set threshold may be invalid. Manually updating many thresholds across many\nmodels is time-consuming and brittle. One mitigation strategy for this kind of problem appears in\n[14], in which thresholds are learned via simple evaluation on heldout validation data.\n\nMonitoring and Testing. Unit testing of individual components and end-to-end tests of running\nsystems are valuable, but in the face of a changing world such tests are not suf\ufb01cient to provide\nevidence that a system is working as intended. Comprehensive live monitoring of system behavior\nin real time combined with automated response is critical for long-term system reliability.\n\nThe key question is: what to monitor? Testable invariants are not always obvious given that many\nML systems are intended to adapt over time. We offer the following starting points.\n\n\u2022 Prediction Bias. In a system that is working as intended, it should usually be the case that\nthe distribution of predicted labels is equal to the distribution of observed labels. This is\nby no means a comprehensive test, as it can be met by a null model that simply predicts\naverage values of label occurrences without regard to the input features. However, it is a\nsurprisingly useful diagnostic, and changes in metrics such as this are often indicative of\nan issue that requires attention. For example, this method can help to detect cases in which\nthe world behavior suddenly changes, making training distributions drawn from historical\ndata no longer re\ufb02ective of current reality. Slicing prediction bias by various dimensions\nisolate issues quickly, and can also be used for automated alerting.\n\n\u2022 Action Limits. In systems that are used to take actions in the real world, such as bidding\non items or marking messages as spam, it can be useful to set and enforce action limits as a\nsanity check. These limits should be broad enough not to trigger spuriously. If the system\nhits a limit for a given action, automated alerts should \ufb01re and trigger manual intervention\nor investigation.\n\n\u2022 Up-Stream Producers. Data is often fed through to a learning system from various up-\nstream producers. These up-stream processes should be thoroughly monitored, tested, and\nroutinely meet a service level objective that takes the downstream ML system needs into\naccount. Further any up-stream alerts must be propagated to the control plane of an ML\nsystem to ensure its accuracy. Similarly, any failure of the ML system to meet established\nservice level objectives be also propagated down-stream to all consumers, and directly to\ntheir control planes if at all possible.\n\nBecause external changes occur in real-time, response must also occur in real-time as well. Relying\non human intervention in response to alert pages is one strategy, but can be brittle for time-sensitive\nissues. Creating systems to that allow automated response without direct human intervention is often\nwell worth the investment.\n\n8 Other Areas of ML-related Debt\n\nWe now brie\ufb02y highlight some additional areas where ML-related technical debt may accrue.\n\nIf data replaces code in ML systems, and code should be tested, then it seems\nData Testing Debt.\nclear that some amount of testing of input data is critical to a well-functioning system. Basic sanity\nchecks are useful, as more sophisticated tests that monitor changes in input distributions.\n\n7\n\n\fReproducibility Debt. As scientists, it is important that we can re-run experiments and get similar\nresults, but designing real-world systems to allow for strict reproducibility is a task made dif\ufb01cult by\nrandomized algorithms, non-determinism inherent in parallel learning, reliance on initial conditions,\nand interactions with the external world.\n\nProcess Management Debt. Most of the use cases described in this paper have talked about the\ncost of maintaining a single model, but mature systems may have dozens or hundreds of models\nrunning simultaneously [14, 6]. This raises a wide range of important problems, including the\nproblem of updating many con\ufb01gurations for many similar models safely and automatically, how to\nmanage and assign resources among models with different business priorities, and how to visualize\nand detect blockages in the \ufb02ow of data in a production pipeline. Developing tooling to aid recovery\nfrom production incidents is also critical. An important system-level smell to avoid are common\nprocesses with many manual steps.\n\nCultural Debt. There is sometimes a hard line between ML research and engineering, but this\ncan be counter-productive for long-term system health. It is important to create team cultures that\nreward deletion of features, reduction of complexity, improvements in reproducibility, stability, and\nmonitoring to the same degree that improvements in accuracy are valued. In our experience, this is\nmost likely to occur within heterogeneous teams with strengths in both ML research and engineering.\n\n9 Conclusions: Measuring Debt and Paying it Off\n\nTechnical debt is a useful metaphor, but it unfortunately does not provide a strict metric that can be\ntracked over time. How are we to measure technical debt in a system, or to assess the full cost of this\ndebt? Simply noting that a team is still able to move quickly is not in itself evidence of low debt or\ngood practices, since the full cost of debt becomes apparent only over time. Indeed, moving quickly\noften introduces technical debt. A few useful questions to consider are:\n\n\u2022 How easily can an entirely new algorithmic approach be tested at full scale?\n\n\u2022 What is the transitive closure of all data dependencies?\n\n\u2022 How precisely can the impact of a new change to the system be measured?\n\n\u2022 Does improving one model or signal degrade others?\n\n\u2022 How quickly can new members of the team be brought up to speed?\n\nWe hope that this paper may serve to encourage additional development in the areas of maintainable\nML, including better abstractions, testing methodologies, and design patterns. Perhaps the most\nimportant insight to be gained is that technical debt is an issue that engineers and researchers both\nneed to be aware of. Research solutions that provide a tiny accuracy bene\ufb01t at the cost of massive\nincreases in system complexity are rarely wise practice. Even the addition of one or two seemingly\ninnocuous data dependencies can slow further progress.\n\nPaying down ML-related technical debt requires a speci\ufb01c commitment, which can often only be\nachieved by a shift in team culture. Recognizing, prioritizing, and rewarding this effort is important\nfor the long term health of successful ML teams.\n\nAcknowledgments\n\nThis paper owes much to the important lessons learned day to day in a culture that values both\ninnovative ML research and strong engineering practice. Many colleagues have helped shape our\nthoughts here, and the bene\ufb01t of accumulated folk wisdom cannot be overstated. We would like\nto speci\ufb01cally recognize the following: Roberto Bayardo, Luis Cobo, Sharat Chikkerur, Jeff Dean,\nPhilip Henderson, Arnar Mar Hrafnkelsson, Ankur Jain, Joe Kovac, Jeremy Kubica, H. Brendan\nMcMahan, Satyaki Mahalanabis, Lan Nie, Michael Pohl, Abdul Salem, Sajid Siddiqi, Ricky Shan,\nAlan Skelly, Cory Williams, and Andrew Young.\n\nA short version of this paper was presented at the SE4ML workshop in 2014 in Montreal, Canada.\n\n8\n\n\fReferences\n\n[1] R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, H. Jiang, T. Qiu, A. Reznichenko, D. Ryabkov,\nM. Singh, and S. Venkataraman. Photon: Fault-tolerant and scalable joining of continuous data streams.\nIn SIGMOD \u201913: Proceedings of the 2013 international conference on Management of data, pages 577\u2013\n588, New York, NY, USA, 2013.\n\n[2] A. Anonymous. Machine learning: The high-interest credit card of technical debt. SE4ML: Software\n\nEngineering for Machine Learning (NIPS 2014 Workshop).\n\n[3] L. Bottou, J. Peters, J. Qui\u02dcnonero Candela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray,\nP. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational\nadvertising. Journal of Machine Learning Research, 14(Nov), 2013.\n\n[4] W. J. Brown, H. W. McCormick, T. J. Mowbray, and R. C. Malveau. Antipatterns: refactoring software,\n\narchitectures, and projects in crisis. 1998.\n\n[5] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an ef\ufb01cient and\nscalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and\nImplementation, OSDI \u201914, Broom\ufb01eld, CO, USA, October 6-8, 2014., pages 571\u2013582, 2014.\n\n[6] B. Dalessandro, D. Chen, T. Raeder, C. Perlich, M. Han Williams, and F. Provost. Scalable hands-\nIn Proceedings of the 20th ACM SIGKDD international\n\nfree transfer learning for online advertising.\nconference on Knowledge discovery and data mining, pages 1573\u20131582. ACM, 2014.\n\n[7] M. Fowler. Code smells. http://http://martinfowler.com/bliki/CodeSmell.html.\n\n[8] M. Fowler. Refactoring: improving the design of existing code. Pearson Education India, 1999.\n\n[9] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In\n\nAdvances in neural information processing systems, pages 817\u2013824, 2008.\n\n[10] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and\nB. Su. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on\nOperating Systems Design and Implementation, OSDI \u201914, Broom\ufb01eld, CO, USA, October 6-8, 2014.,\npages 583\u2013598, 2014.\n\n[11] J. Lin and D. Ryaboy. Scaling big data mining infrastructure: the twitter experience. ACM SIGKDD\n\nExplorations Newsletter, 14(2):6\u201319, 2013.\n\n[12] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov,\nD. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, and J. Kubica. Ad click\nprediction: a view from the trenches. In The 19th ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11-14, 2013, 2013.\n\n[13] J. D. Morgenthaler, M. Gridnev, R. Sauciuc, and S. Bhansali. Searching for build debt: Experiences\nmanaging technical debt at google. In Proceedings of the Third International Workshop on Managing\nTechnical Debt, 2012.\n\n[14] D. Sculley, M. E. Otey, M. Pohl, B. Spitznagel, J. Hainsworth, and Y. Zhou. Detecting adversarial adver-\ntisements in the wild. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, 2011.\n\n[15] Securities and E. Commission. SEC Charges Knight Capital With Violations of Market Access Rule, 2013.\n\n[16] A. Spector, P. Norvig, and S. Petrov. Google\u2019s hybrid approach to research. Communications of the ACM,\n\n55 Issue 7, 2012.\n\n[17] A. Zheng. The challenges of building machine learning tools for the masses. SE4ML: Software Engineer-\n\ning for Machine Learning (NIPS 2014 Workshop).\n\n9\n\n\f", "award": [], "sourceid": 1486, "authors": [{"given_name": "D.", "family_name": "Sculley", "institution": "Google Research"}, {"given_name": "Gary", "family_name": "Holt", "institution": null}, {"given_name": "Daniel", "family_name": "Golovin", "institution": "Google, Inc."}, {"given_name": "Eugene", "family_name": "Davydov", "institution": "Google, Inc."}, {"given_name": "Todd", "family_name": "Phillips", "institution": "Google, Inc."}, {"given_name": "Dietmar", "family_name": "Ebner", "institution": null}, {"given_name": "Vinay", "family_name": "Chaudhary", "institution": "Google, Inc."}, {"given_name": "Michael", "family_name": "Young", "institution": "Google, Inc."}, {"given_name": "Jean-Fran\u00e7ois", "family_name": "Crespo", "institution": "Google, Inc."}, {"given_name": "Dan", "family_name": "Dennison", "institution": "Google, Inc."}]}