{"title": "Data Cleansing for Models Trained with SGD", "book": "Advances in Neural Information Processing Systems", "page_first": 4213, "page_last": 4222, "abstract": "Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an algorithm that can identify influential instances without using any domain knowledge. The proposed algorithm automatically cleans the data, which does not require any of the users' knowledge. Hence, even non-experts can improve the models. The existing methods require the loss function to be convex and an optimal model to be obtained, which is not always the case in modern machine learning. To overcome these limitations, we propose a novel approach specifically designed for the models trained with stochastic gradient descent (SGD). The proposed method infers the influential instances by retracing the steps of the SGD while incorporating intermediate models computed in each step. Through experiments, we demonstrate that the proposed method can accurately infer the influential instances. Moreover, we used MNIST and CIFAR10 to show that the models can be effectively improved by removing the influential instances suggested by the proposed method.", "full_text": "Data Cleansing for Models Trained with SGD\n\nSatoshi Hara\u21e4\n\nAtsushi Nitanda\u2020\n\nTakanori Maehara\u2021\n\nAbstract\n\nData cleansing is a typical approach used to improve the accuracy of machine\nlearning models, which, however, requires extensive domain knowledge to identify\nthe in\ufb02uential instances that affect the models. In this paper, we propose an algo-\nrithm that can identify in\ufb02uential instances without using any domain knowledge.\nThe proposed algorithm automatically cleans the data, which does not require any\nof the users\u2019 knowledge. Hence, even non-experts can improve the models. The\nexisting methods require the loss function to be convex and an optimal model to be\nobtained, which is not always the case in modern machine learning. To overcome\nthese limitations, we propose a novel approach speci\ufb01cally designed for the models\ntrained with stochastic gradient descent (SGD). The proposed method infers the\nin\ufb02uential instances by retracing the steps of the SGD while incorporating interme-\ndiate models computed in each step. Through experiments, we demonstrate that\nthe proposed method can accurately infer the in\ufb02uential instances. Moreover, we\nused MNIST and CIFAR10 to show that the models can be effectively improved\nby removing the in\ufb02uential instances suggested by the proposed method.\n\n1\n\nIntroduction\n\nBuilding accurate models is one of the fundamental goals in machine learning. If the obtained model\nis not satisfactory, users try to improve the model in several ways such as by modifying input features,\ncleansing data, or even by gathering additional data. Error analysis [Ng, 2017] is a typical approach\nfor this purpose. In this analysis, the users hypothesize the cause of model\u2019s failure by investigating\nimportant features or examining the misclassi\ufb01ed instances. However, a good hypothesis requires\nexperience and domain knowledge. Therefore, it is dif\ufb01cult for non-domain experts or non-machine\nlearning specialists to build accurate models.\nHow can we help non-experts to build accurate machine learning models? In this study, we focus on\nthe following data cleansing problem that removes \u201charmful\u201d instances from the training set.\nProblem 1 (Data Cleansing). Find a subset of the training instances such that the trained model\nobtained after removing the subset has a better accuracy.\n\nCurrently, the users hypothesize the training instances that can have certain in\ufb02uences on the resulting\nmodels by inspecting instances based on the domain knowledge. Our aim is to develop an algorithm\nthat can identify in\ufb02uential instances without using any domain knowledge. With such an algorithm,\nthe users do not need to hypothesize in\ufb02uential instances. Instead, the algorithm automatically\ncleans the data, which does not require any of the users\u2019 knowledge. Hence, with this process, even\nnon-experts can improve the models.\nFor data cleansing, we need to determine the training instances that affect the model. In the literature\nof statistics, an in\ufb02uential instance is de\ufb01ned as the instance that leads to a distinct model from the\ncurrent model if the corresponding instance is absent [Cook, 1977]. A naive approach to determine\n\n\u21e4satohara@ar.sanken.osaka-u.ac.jp, Osaka University, Japan\n\u2020nitanda@mist.i.u-tokyo.ac.jp, The University of Tokyo, Japan\n\u2021takanori.maehara@riken.jp, RIKEN AIP, Japan\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthese in\ufb02uential instances is, therefore, to retrain the model by leaving every one instance out of the\ntraining set, which can be computationally very demanding. To ef\ufb01ciently infer an in\ufb02uential instance\nwithout retraining, the convexity of the loss function plays an important role. Pioneering studies by\nBeckman and Trussell [1974], Cook [1977], and Pregibon [1981] have shown that, for some convex\nloss functions, the in\ufb02uential instances can be inferred without model retraining by utilizing the\noptimality condition on the training loss, given that an optimal model is obtained. A recent study\nby Koh and Liang [2017] further generalized these approaches to any smooth and strongly convex\nloss functions by incorporating the idea of in\ufb02uence function [Cook and Weisberg, 1980] in robust\nstatistics (see Section 6).\nThe focus of this study is to go beyond the convexity and optimality. We aim to develop an algorithm\nthat can infer in\ufb02uential instances even for non-convex objectives such as deep neural networks.\nTo this end, we propose a completely different approach to infer the in\ufb02uential instances. The\nproposed approach is based on the stochastic gradient descent (SGD). Modern machine learning\nmodels including deep neural networks are trained using SGD and its variants. Our idea is to rede\ufb01ne\nthe notion of in\ufb02uence for the models trained with SGD, which we named SGD-in\ufb02uence. Based on\nSGD-in\ufb02uence, we propose a method that infers the in\ufb02uential instances without model retraining.\nThe proposed method is based solely on the analysis of SGD. Different from the existing methods,\nthe proposed method does not require the optimality conditions to hold true on the obtained models.\nThe proposed method is therefore suitable to the SGD context where we no longer look for the exact\noptimum of the training loss. In SGD, we instead look for the minimum error on the validation set,\nwhich leads to early stopping of the optimization that can violate the optimality condition.\nIn summary, the contribution of this study is threefold.\n\u2022 We propose a new de\ufb01nition of the in\ufb02uence, which we name as SGD-in\ufb02uence, for the models\ntrained with SGD. SGD-in\ufb02uence is de\ufb01ned based on the counterfactual effect: what if an instance\nis absent in SGD, how largely will the resulting model change?\n\n\u2022 We propose a novel estimator of SGD-in\ufb02uence based on the analysis of SGD. We then construct\na proposed in\ufb02uence estimation algorithm based on this estimator. We also study the estimation\nerror of the proposed estimator on both convex and non-convex loss functions.\n\n\u2022 Through experiments, we demonstrate that the proposed method can accurately infer the in\ufb02uential\ninstances. Moreover, we used MNIST and CIFAR10 to show that the models can be effectively\nimproved by removing the in\ufb02uential instances suggested by the proposed method.\n\n2 Preliminaries\n\nNotations For vectors a, b 2 Rp, we denote the inner product by ha, bi =Pp\nby kak =pha, ai. For a function f (\u2713) with \u2713 2 Rp, we denote its derivative by r\u2713f (\u2713).\nSupervised Learning Let z = (x, y) 2 Rd \u21e5Y be an observation, which is a pair of d-dimensional\ninput feature vector x and output y in a certain domain Y (e.g., Y = R for regression, and Y =\n{1, 1} for binary classi\ufb01cation). The objective of learning is to \ufb01nd a model f (x; \u2713) that well\napproximates the output as y \u21e1 f (x; \u2713). Here, \u2713 2 Rp is a parameter of the model.\nn=1 be a training set with independent and identically distributed\nLet D := {zn = (xn, yn)}N\nobservations. We denote the loss function for an instance z with the parameter \u2713 by `(z; \u2713). The\nlearning problem is then denoted as\n\ni=1 aibi, and the norm\n\n\u02c6\u2713 = argmin\u27132Rp\n\n(1)\nSGD Let g(z; \u2713) := r\u2713`(z; \u2713). SGD starts the optimization from the initial parameter \u2713[1]. An\nupdate rule of the mini-batch SGD at the t-th step for the problem (1) is given by \u2713[t+1] \u2713[t] \n|St|Pi2St\n\u2318t\ng(zi; \u2713[t]), where St denotes the set of instance indices used in the t-th step, and \u2318t > 0\nis the learning rate. We denote the number of total SGD steps by T .\n\nn=1 `(zn; \u2713).\n\n1\n\nNPN\n\n3 SGD-In\ufb02uence\n\nWe propose a novel notion of in\ufb02uence for the models trained with SGD, which we name as SGD-\nin\ufb02uence. We then formalize the in\ufb02uence estimation problem we consider in this paper.\n\n2\n\n\fj \u2318t\n\nj \u2713[t] as the SGD-in\ufb02uence\n\n|St|Pi2St\\{j} g(zi; \u2713[t]\nj).\n\nWe de\ufb01ne SGD-in\ufb02uence based on the following counterfactual SGD where one instance is absent.\nDe\ufb01nition 2 (Counterfactual SGD). The counterfactual SGD starts the optimization from the same\ninitial parameter as the ordinary SGD \u2713[1]\nj = \u2713[1]. The t-th step of the counterfactual SGD with the\nj-th instance zj absent is de\ufb01ned by \u2713[t+1]\nj \u2713[t]\nDe\ufb01nition 3 (SGD-In\ufb02uence). We refer to the parameter difference \u2713[t]\nof the instance zj 2 D at step t.\nIt should be noted that SGD-in\ufb02uence can be de\ufb01ned in every step of SGD, even for non-optimal\nmodels. Thus, SGD-in\ufb02uence is a suitable notion of in\ufb02uence for the cases where we no longer look\nfor the exact optimal of (1). In this study, we speci\ufb01cally focus on estimating an inner product of a\nquery vector u 2 Rp and the SGD-in\ufb02uence after T SGD steps, as follows.\nProblem 4 (Linear In\ufb02uence Estimation (LIE)). For a given query vector u 2 Rp, estimate the linear\nin\ufb02uence L[T ]\nLIE includes several important applications (see [Koh and Liang, 2017]). One important application is\nthe in\ufb02uence estimation on the loss. If we take u = r\u2713`(x; \u2713[T ]) for an input x, LIE amounts to esti-\nmating the change in loss L[T ]\nj (r\u2713`(x; \u2713[T ]))\nindicates that the loss on the input x can be decreased by removing zj.\nNote that SGD-in\ufb02uence as well as linear in\ufb02uence can be computed exactly by running the counter-\nfactual SGD for all zj 2 D. However, this requires running SGD N times, which is computationally\ndemanding even for N \u21e1 100. Therefore, our goal is to develop an estimation algorithm for LIE,\nwhich does not require running SGD multiple times.\n\nj ) `(x; \u2713[T ]). Negative L[T ]\n\nj (r\u2713`(x; \u2713[T ])) \u21e1 `(x; \u2713[T ]\n\nj (u) := hu, \u2713[T ]\n\nj \u2713[T ]i.\n\n4 Estimating SGD-In\ufb02uence\n\nIn this section, we present our proposed estimator of SGD-in\ufb02uence and show its theoretical properties.\nWe then derive an algorithm for LIE based on the estimator in the next section.\n\nHere, we assume that\n\n4.1 Proposed Estimator\nWe estimate SGD-in\ufb02uence using the \ufb01rst-order Taylor approximation of\nent.\nthen obtain\n1\ndenoting an identity matrix by I, we have\nj \u2713[t1]) \n\nthe gradi-\nthe loss function `(z; \u2713) is twice differentiable. We\n:=\n\u2713`(zi; \u2713[t]) is the Hessian of the loss on the mini-batch St. With this approximation,\n\nj) r\u2713`(zi; \u2713[t])\u2318 \u21e1 H [t](\u2713[t]\n(r\u2713`(zi; \u2713[t1]\nj\n\n|St|Pi2St\u21e3r\u2713`(zi; \u2713[t]\n\n|St|Pi2St r2\n\nj \u2713[t]), where H [t]\n\nj \u2713[t] = (\u2713[t1]\n\u2713[t]\n\n) r\u2713`(zi; \u2713[t1]))\n\n1\n\n\u2318t1\n\n|St1| Xi2St1\nj \u2713[t1]).\n\n\u21e1 (I \u2318t1H [t1])(\u2713[t1]\n\nWe construct an estimator for the SGD-in\ufb02uence based on this approximation. For simplicity, here,\nwe focus on one-epoch SGD where each instance appears only once. Let Zt := I \u2318tH [t] and\n\u21e1(j) be the SGD step where the instance zj is used. By recursively applying the approximation and\nrecalling that \u2713[\u21e1(j)+1]\n\ng(zj; \u2713[\u21e1(j)]), we obtain the following estimator\n\nj\n\u2713[T ]\nj \u2713[T ] \u21e1\n\n \u2713[\u21e1(j)+1] = \u2318\u21e1(j)\n|S\u21e1(j)|\nZT1ZT2 \u00b7\u00b7\u00b7 Z\u21e1(j)+1g(zj; \u2713[\u21e1(j)]) =: \u2713j.\n\n(2)\n\n\u2318\u21e1(j)\n|S\u21e1(j)|\n\n4.2 Properties of \u2713j\nHere, we evaluate the estimation error of the proposed estimator \u2713j for both convex and non-\nconvex loss functions. A notable property of the estimator \u2713j is that, unlike existing methods, the\nerror can be evaluated even without assuming the convexity of the loss function `(z; \u2713).\nConvex Loss For smooth and strongly convex problems, there exists a uniform bound on the gap\nbetween the SGD-in\ufb02uence \u2713[T ]\n\nj \u2713[T ] and the proposed estimator \u2713j.\n\n3\n\n\fwhere hj(a) := \u2318\u21e1(j)\n\n\u2713`(z; \u2713) \u21e4I for all z, \u2713. If \u2318s \uf8ff 1/\u21e4, then we get\n\nTheorem 5. Assume that `(z; \u2713) is twice differentiable with respect to the parameter \u2713 and there\nexist , \u21e4 > 0 such that I r2\nj \u2713[T ]) \u2713jk \uf8ffq2(hj()2 + hj(\u21e4)2),\nk(\u2713[T ]\n|S\u21e1(j)|QT1\ns=\u21e1(j)+1(1 \u2318sa)kg(zj; \u2713[\u21e1(j)])k.\n\nNon-Convex Loss For non-convex loss functions, the aforementioned uniform bound no longer\nholds. However, we can still evaluate the growth of the estimation error. For simplicity, we consider a\nconstant learning rate \u2318 = O(/pT ) that depends only on the number of total SGD steps T . It should\nbe noted that SGD with this learning rate is theoretically justi\ufb01ed to converge to a stationary point\n[Ghadimi and Lan, 2013]. The next theorem indicates that \u2713j can approximate SGD-in\ufb02uence\nwell if Hessian r2\nTheorem 6. Assume that `(z; \u2713) is twice differentiable and r2\n\u2713`(z; \u2713) is L-Lipschitz continuous with\nrespect to \u2713. Moreover, assume that kr\u2713`(z; \u2713)k \uf8ff G, r2\n\u2713`(z; \u2713) \u21e4I for all z, \u2713. Consider SGD\nwith a learning rate \u2318 = O(/pT ). Then,\n\n\u2713`(\u2713, z) is Lipschitz continuous.\n\n(3)\n\nk(\u2713[T ]\n\nj \u2713[T ]) \u2713jk \uf8ff\n\n\u21e4\n\nexpO(\u21e4pT ) 2T G2L\n\n.\n\n(4)\n\n\u2318\u21e1k (j)\n|S\u21e1k (j)|\ns=1\n\nhu[\u21e1k(j)],\n\n\u2318\u21e1k(j)\n|S\u21e1k(j)|\n\nZTs\u2318 \u2318\u21e1k (j)\n\nAlgorithm 1 LIE for SGD: Training Phase\n\n5 Proposed Method for LIE\nWe now derive our proposed method for LIE. First, we extend the estimator \u2713j\nto multi-epoch SGD. Let \u21e11(j),\u21e1 2(j), . . . ,\u21e1 K(j) be the steps where the instance zj\nis\nused in K-epoch SGD. We estimate the effect of\nthe step \u21e1k(j) based on (2) as\ng(zj; \u2713[\u21e1k(j)]). We then add all the effects and derive the estima-\nZT1ZT2 \u00b7\u00b7\u00b7 Z\u21e1k(j)+1\n\nk=1\u21e3QT\u21e1k(j)1\ntor \u2713j =PK\nKXk=1\n\nInitialize the parameter \u2713[1]\nInitialize the sequence as null: A ;\nfor t = 1, 2, . . . , T 1 do\n|St|Pi2St\n\n|S\u21e1k (j)|\nLet u[t] := Zt+1Zt+2 . . . ZT1u. LIE based on\nthe estimator \u2713j is then obtained as\ng(zj; \u2713[\u21e1k(j)])i.\nhu, \u2713ji =\nIt should be noted that u[t] can be computed\nrecursively u[t] Zt+1u[t+1] = u[t+1] \n\u2318t+1H\u2713[t+1]u[t+1] by retracing SGD. The proposed\nmethod is based on this recursive computation.\nThe proposed method consists of two phases, the\ntraining phase and the inference phase, as shown in\nAlgorithms 1 and 2. In the training phase in Algo-\nrithm 1, during running SGD, we store the tuple of\nthe instance indices St, learning rate \u2318t, and param-\neter \u2713[t].4 In the inference phase in Algorithm 2,\nwe retrace the stored information and compute u[t]\nin each step.\nNote that, in Algorithm 2, we need to compute H [t]u[t]. A naive implementation requires O(p2)\nmemory to store the matrix H [t], which can be prohibitive for very large models. We can avoid this\ndif\ufb01culty by directly computing H [t]u[t] without the explicit computation of H [t]. Because H [t]u[t] =\n1\n\nInitialize the in\ufb02uence: \u02c6L[T ]\nfor t = T 1, T 2, . . . , 1 do\n(St,\u2318 t,\u2713 [t]) A[t]\n// update the linear in\ufb02uence of zj\n\u02c6L[T ]\nj (u) += hu, \u2318t\n|St|\nu = \u2318tH [t]u // update u\n\nAlgorithm 2 LIE for SGD: Inference Phase\nRequire: u 2 Rp\n\n|St|Pi2St r\u2713hu[t],r\u2713`(zi; \u2713[t])i, we only need to compute the derivative of hu[t],r\u2713`(zi; \u2713[t])i,\n\nwhich does not require the explicit computation of H [t]. For example, in Tensor\ufb02ow, this can be\nimplemented in a few lines.5 The time complexity for the inference phase is O(T M ), where M is\nthe largest batch size in SGD and is the complexity for computing the parameter gradient.\n\nA[t] (St,\u2318 t,\u2713 [t])\n\u2713[t+1] \u2713[t] \u2318t\n\nj (u) 0,8j\n// load information\n\n// store information\n\ng(zi; \u2713[t])\n\ng(zj; \u2713[t])i,8j 2 St\n\ng(zj; \u2713[\u21e1k(j)]).\n\nend for\n\nend for\n\n4For Momentum-SGD, we can avoid storing the parameter \u2713[t] [Maclaurin et al., 2015].\n5grads = [tf.gradients(loss[i], theta) for i in St];\n\n[tf.gradients(tf.tensordot(u, g, axes), theta) for g in grads], axis)\n\nHu = tf.reduce_mean(\n\n4\n\n\f6 Related Studies\n\nIn\ufb02uence Estimation Traditional studies on in\ufb02uence estimation considered the change in the\nsolution \u02c6\u2713 to the problem (1) if an instance zj was absent. For this purpose, they considered the\ncounterfactual problem \u02c6\u2713j = argmin\u2713PN\nn=1;n6=j `(z; \u2713). The goal of the traditional in\ufb02uence\nestimation is to obtain an estimate of the difference \u02c6\u2713j \u02c6\u2713 without retraining the models. Pioneering\nstudies by Beckman and Trussell [1974], Cook [1977], and Pregibon [1981] have shown that the\nin\ufb02uence \u02c6\u2713j \u02c6\u2713 can be computed analytically for linear and generalized linear models. Koh and\nLiang [2017] considered a further generalizations of those previous studies. They introduced the\nfollowing approximation for strongly convex loss functions `(z; \u2713):\n\n\u02c6\u2713j \u02c6\u2713 \u21e1 1\n\nN\n\n\u02c6H1r\u2713`(zj; \u02c6\u2713),\n\n(5)\n\nNPz2D r2`(z; \u02c6\u2713) is the Hessian of the loss for the optimal model. We note that Zhang\n\nwhere \u02c6H = 1\net al. [2018] and Khanna et al. [2019] further extended this approach. Zhang et al. [2018] used\nthis approach to \ufb01x the labels of the training instances. Khanna et al. [2019] proposed to \ufb01nd the\nin\ufb02uential instances using the Bayesian quadrature, which includes (5) as its special case.\nOur study differs from these traditional approaches in two ways. First, the proposed SGD-in\ufb02uence\ndoes not assume the optimality of the obtained models. We instead consider the models obtained in\neach step of SGD, which are not necessarily optimal. Second, the proposed method does not require\nthe function loss `(z; \u2713) to be convex. The proposed method is valid even for non-convex losses.\nEstimation of Data Importance Some recent works [Ren et al., 2018; Ghorbani and Zou, 2019]\nfocused on estimating the importance of each training instance. Ren et al. [2018] proposed weighting\neach training instance so that the validation loss to be minimized. Ghorbani and Zou [2019] introduced\nsome axioms that the data importance should satisfy, and derived Shapley value as an ideal importance.\nThese studies demonstrated the effectiveness of the proposed importances only empirically. The\nadvantage of our study from these prior studies is in theories of the estimation error, that clari\ufb01ed in\nwhich circumstances the estimated importances are accurate.\nLearning from Noisy Labels There are plenty of studies for training models from noisy la-\nbels [Aslam and Decatur, 1996; Brodley and Friedl, 1999; Natarajan et al., 2013; Zhang et al.,\n2018]. The difference from our study is that these studies assumed that the label noise is an only\nissue. However, as Figures 13 and 14 show, the model performance depends not only on label noises\nbut atypical inputs also. For example, in Figure 13, we can \ufb01nd several atypical instances that even\nhuman cannot label them con\ufb01dently. These atypical instances should be removed from the training\nrather than \ufb01xing the labels because we cannot put correct labels to them.\nOutlier Detection A typical approach for data cleansing is outlier detection. Outlier detection is used\nto remove abnormal instances from the training set before training the model to ensure that the model\nis not affected by the abnormal instances. For tabular data, there are several popular methods such as\nOne-class SVM [Sch\u00f6lkopf et al., 2001], Local Outlier Factor [Breunig et al., 2000], and Isolation\nForest [Liu et al., 2008]. For complex data such as images, autoencoders can also be used [Aggarwal,\n2016; Zhou and Paffenroth, 2017] along with generative adversarial networks [Schlegl et al., 2017].\nIt should be noted that although these methods can \ufb01nd abnormal instances, they are not necessarily\nin\ufb02uential to the resulting models, as we will show in the experiments.\n\n7 Experiments\n\nHere, we evaluate the two aspects of the proposed method: the performances of LIE and data\ncleansing. We used Python 3 and PyTorch 1.0 for the experiments.6 The experiments were conducted\non 64bit Ubuntu 16.04 with six Intel Xeon E5-1650 3.6GHz CPU, 128GB RAM, and four GeForce\nGTX 1080ti.\n\n6The codes are available at https://github.com/sato9hara/sgd-in\ufb02uence\n\n5\n\n\f7.1 Evaluation of LIE\n\nWe \ufb01rst evaluate the effectiveness of the proposed method in the estimation of linear in\ufb02uence. For\nthis purpose, we arti\ufb01cially created small datasets to ensure that the true linear in\ufb02uence is computable.\nThe detailed setup can be found in Appendix C.1.\nSetup We used three datasets: Adult [Dua and Karra Taniskidou, 2017], 20Newsgroups7, and\nMNIST [LeCun et al., 1998]. These are common benchmarks in tabular data analysis, natural\nlanguage processing, and image recognition, respectively. We adopted these three datasets to\ndemonstrate the validity of the proposed method across different data domains. For 20Newsgroups\nand MNIST, we selected the two document categories ibm.pc.hardware and mac.hardware and\nimages from one and seven, respectively, so that the problem to be binary classi\ufb01cation.\nTo observe the validity of the proposed method beyond convexity, we adopted two models, linear\nlogistic regression and deep neural networks. For deep neural networks, we used a network with two\nfully connected layers with eight units each and ReLU activation. We used the sigmoid function at\nthe output layer and adopted the cross entropy as the loss function. It should be noted that the loss\nfunction for linear logistic regression is convex, while that for deep neural networks is non-convex.\nIn the experiments, we randomly subsampled 200 instances for the training set D and validation set D0.\nWe then estimated the linear in\ufb02uence for the validation loss using Algorithm 2. Here, we set the query\n|D0|Pz02D0 r\u2713`(z0; \u2713[T ]). The estimation of linear in\ufb02uence thus amounts to esti-\nvector u as u = 1\nj ) `(z0; \u2713[T ])\u2318.\nmating the change in the validation loss hu, \u2713[T ]\nEvaluation We ran the counterfactual SGD for all zj 2 D and computed the true linear in\ufb02uence.\nFor evaluation, we compared the estimated in\ufb02uences with this true in\ufb02uence using Kendall\u2019s tau\nand Jaccard index. With Kendall\u2019s tau, a typical metric for ordinal associations, we measured the\ncorrelation between the estimated and true in\ufb02uences. Kendall\u2019s tau takes the value between plus and\nminus one, where one indicates that the orders of the estimated and true in\ufb02uences are identical. With\nJaccard index, we measured the identi\ufb01cation accuracy of the in\ufb02uential instances. For data cleansing,\nthe users are interested in instances with large positive or negative in\ufb02uences. We selected ten\ninstances with the largest positive and negative true in\ufb02uences and constructed a set of 20 important\ninstances. We compared this important instances with the estimated ones using Jaccard index, which\nvaries between zero and one, where the value one indicates that the sets are identical.\nResults We adopted the method proposed by Koh and Liang [2017] in (5) as the baseline, abbreviated\nas K&L. For deep neural networks, the Hessian matrix is not positive de\ufb01nite, which makes the\nestimator (5) invalid. To alleviate the effect of negative eigenvalues, we added a positive constant 1.0\nto the diagonal as suggested by Koh and Liang [2017].\nFigure 1 shows a clear advantage of the proposed method. The proposed method successfully\nestimated the true linear in\ufb02uences with high precision. The estimated in\ufb02uences were concentrated\non the diagonal lines, indicating that the estimated in\ufb02uences accurately approximated the true\nin\ufb02uences. In contrast, the estimated in\ufb02uences obtained by K&L were less accurate. We observed\nthat the estimator (5) sometimes gets numerically unstable owing to the presence of small eigenvalues\nin the Hessian matrix.\nFor the quantitative comparison, we repeated the experiment by randomly changing the instance\nsubsampling 100 times. Table 1 lists the average Kendall\u2019s tau and Jaccard index. The results again\nshow that the proposed method can accurately estimate the true linear in\ufb02uences.\n\n|D0|Pz02D0\u21e3`(z0; \u2713[T ]\n\nj \u2713[T ]i \u21e1 1\n\n7.2 Evaluation on Data Cleansing\n\nWe now show that the proposed method is effective for data cleansing. Speci\ufb01cally, on MNIST [LeCun\net al., 1998] and CIFAR10 [Krizhevsky and Hinton, 2009], we demonstrate that we can effectively\nimprove the models by removing in\ufb02uential instances suggested by the proposed method. The\ndetailed setup and full results can be found in Appendix C.2 and C.4.\nSetup We used MNIST and CIFAR10. From the original training set, we held out randomly selected\n10,000 instances for the validation set and used the remaining instances as the training set. As models,\n\n7http://qwone.com/~jason/20Newsgroups/\n\n6\n\n\fK&L Proposed\n2\n\n\u00b7102\n\ny = x\n\nd\ne\nt\na\nm\n\n\u00b7102\n\n1\n\n0.5\n\n0\n\n1\n\n0\n\n1\n2\n\nd\ne\nt\na\nm\n\ni\nt\ns\nE\n\nd\ne\nt\na\nm\n\ni\nt\ns\nE\n\n\u00b7102\n1\n\n0\n\n2 1\n2\nTrue Linear In\ufb02uence\n(a) LogReg: Adult\n\n5 \u00b7103\n\n0\n\n5\n\n\u00b7103\n5\n\n5\nTrue Linear In\ufb02uence\n\n0\n\ni\nt\ns\nE\n\n0.5\n1\n\nd\ne\nt\na\nm\n\ni\nt\ns\nE\n\n0.5\n1\n\n\u00b7102\n0.5\n\n1 0.5 0\n1\nTrue Linear In\ufb02uence\n\n(b) LogReg: 20Newsgroups\n\n\u00b7102\n\n1\n\n0.5\n\n0\n\n\u00b7102\n0.5\n\n1 0.5 0\n1\nTrue Linear In\ufb02uence\n\nd\ne\nt\na\nm\n\ni\nt\ns\nE\n\nd\ne\nt\na\nm\n\ni\nt\ns\nE\n\n\u00b7102\n\n2\n\n1\n\n0\n\n1\n2\n\n\u00b7102\n1\n\n0\n\n2 1\n2\nTrue Linear In\ufb02uence\n(c) LogReg: MNIST\n\n5 \u00b7103\n\n0\n\n5\n\n\u00b7103\n5\n\n5\n\n0\n\nTrue Linear In\ufb02uence\n\n(d) DNN: Adult\n\n(e) DNN: 20Newsgroups\n\n(f) DNN: MNIST\n\nFigure 1: Estimated linear in\ufb02uences for linear logistic regression (LogReg) and deep neural networks (DNN)\nfor all the 200 training instances. K&L denotes the method of Koh and Liang [2017].\n\nTable 1: Average Kendall\u2019s tau and Jaccard index (\u00b1 std.).\n\nKendall\u2019s tau\n\nJaccard index\n\nLogReg\n\nProposed\n.93 (.02)\n.94 (.05)\n.95 (.02)\n\nK&L\n\n.85 (.07)\n.82 (.15)\n.70 (.15)\n\nDNN\n\nProposed\n.75 (.10)\n.45 (.12)\n.45 (.12)\n\nK&L\n\n.54 (.12)\n.37 (.12)\n.27 (.16)\n\nLogReg\n\nProposed\n.80 (.10)\n.79 (.15)\n.83 (.10)\n\nK&L\n\n.60 (.17)\n.52 (.19)\n.41 (.16)\n\nDNN\n\nProposed\n.59 (.16)\n.25 (.08)\n.37 (.15)\n\nK&L\n\n.32 (.11)\n.11 (.07)\n.27 (.12)\n\nAdult\n20News\nMNIST\n\nwe used convolutional neural networks. In SGD, we set the epoch K = 20, batch size |St| = 64, and\nlearning rate \u2318t = 0.05.\nAs baselines for data cleansing, in addition to K&L, we adopted two outlier detection methods,\nAutoencoder [Aggarwal, 2016] and Isolation Forest [Liu et al., 2008]. We also adopted random data\nremoval as the baseline. For the proposed method, we introduced an approximate version in this\nexperiment. In Algorithm 2, the proposed method retraces all steps of the SGD. In the approximate\nversion, we retrace only one epoch, which requires less computation than the original algorithm.\nMoreover, it is also storage friendly because we need to store intermediate information only in the\nlast epoch of SGD.\nWe proceeded the experiment as follows. First, we trained the model with SGD using the training set.\nWe then computed the in\ufb02uence of each training instance using the proposed method as well as other\nbaseline methods. Here, we used the same query vector u as in the previous experiment. Finally, we\nremoved the top-m in\ufb02uential instances from the training set and retrained the model. For model\nretraining, we ran normal SGD for 19 epochs and switched to counterfactual SGD in the last epoch.8\nIf the misclassi\ufb01cation rate of the retrained model decreases, we can conclude that the training set\nwas effectively cleansed.\nResults We repeated the experiment by randomly changing the split between the training and\nvalidation set 30 times. Figure 2 shows the misclassi\ufb01cation rates on the test set after data cleansing\nwith each method.9 It is evident from the \ufb01gures that the misclassi\ufb01cation rates decreased after data\ncleansing with the proposed method and its approximate version. We compared the misclassi\ufb01cation\nrates before and after the data cleansing using t-test with the signi\ufb01cance level set to 0.05. We\n\n8We observed that this works well. For the results with full counterfactual SGD, see Apendix C.4.\n9See Appendix C.4 for the full results.\n\n7\n\n\fNo Removal\n\nRandom\n\nAutoencoder\n\nIsolation Forest\n\nProposed\n\nProposed (Approx.)\n\nK&L\n\ne\nt\na\nr\n\nn\no\ni\nt\na\nc\n\ufb01\ni\ns\ns\na\nl\nc\ns\ni\n\nM\n\n0.011\n\n0.01\n\n0.009\n\n0.008\n\n0.007\n\n100\n\ne\nt\na\nr\n\nn\no\ni\nt\na\nc\n\ufb01\ni\ns\ns\na\nl\nc\ns\ni\n\nM\n\n0.19\n\n0.18\n\n0.17\n\n0.16\n\n0.15\n\n100\n\n102\n\n103\n101\n# of instances removed\n\n104\n\n(a) MNIST\n\n102\n\n103\n101\n# of instances removed\n\n(b) CIFAR10\n\n104\n\nFigure 2: Average misclassi\ufb01cation rates on the test set after data cleansing. The errorbars are omitted for better\nvisibility. See Appendix C.4 for the full results.\n\ny = 3\n\ny = 4\n\ny = 8\n\ny = 6\n\ny = deer\n\ny = frog\n\ny = truck\n\ny = dog\n\n(a) Proposed (Approx.)\n\n(b) Autoencoder\n\n(c) Proposed (Approx.)\n\n(d) Autoencoder\n\nFigure 3: Examples of found in\ufb02uential instances and their labels in (a)(b) MNIST and (c)(d) CIFAR10.\n\nobserved that none of the baseline methods except K&L attained statistically signi\ufb01cant improvements.\nBy contrast, the proposed method and its approximate version attained statistically signi\ufb01cant\nimprovements. For both datasets, the proposed method and its approximate version were found to be\nstatistically signi\ufb01cant for the number of removed instances between 10 and 1000, and 10 and 100,\nrespectively.10 Moreover, both methods outperformed K&L. The results con\ufb01rm that the proposed\nmethod can effectively suggest in\ufb02uential instances for data cleansing. We also note that the proposed\nmethod and its approximate version performed comparably well. This observation suggests that, in\npractice, we only need to retrace only one epoch for inferring the in\ufb02uential instances, which requires\nless computation and storing intermediate information only in the last epoch of SGD.\nFigure 3 shows examples of found in\ufb02uential instances. An interesting observation is that Autoencoder\ntended to \ufb01nd images with noisy or vivid backgrounds. Visually, it seems reasonable to select them\nas outliers. However, as we have seen in Figure 2, removing these outliers did not help to improve\nthe models. In contrast, the proposed method found images with confusing shapes or backgrounds.\nAlthough they are not strongly visually appealing as the outliers, Figure 2 con\ufb01rms that these instances\nsigni\ufb01cantly affect the models. These observations indicate that the proposed method could \ufb01nd the\nin\ufb02uential instances, which can be missed even by users with domain knowledge.\n\n8 Conclusion\n\nWe considered supporting non-experts to build accurate machine learning models through data\ncleansing by suggesting in\ufb02uential instances. Speci\ufb01cally, we aimed at establishing an algorithm that\ncan infer the in\ufb02uential instances even for non-convex loss functions such as deep neural networks.\nOur idea is to use the fact that modern machine learning models are trained using SGD. We introduced\na re\ufb01ned notion of in\ufb02uence for the models trained with SGD, which was named SGD-in\ufb02uence. We\nthen proposed an algorithm that can accurately approximate the SGD-in\ufb02uence without running extra\nSGD. We also proved that the proposed method can provide valid estimates even for non-convex\nloss functions. The experimental results have shown that the proposed method can accurately infer\nin\ufb02uential instances. Moreover, on MNIST and CIFAR10, we demonstrated that the models can be\neffectively improved by removing the in\ufb02uential instances suggested by the proposed method.\n\n10See Appendix C.3 for a possible way to determine the number of removal in practice.\n\n8\n\n\fAcknowledgments\nSatoshi Hara is supported by JSPS KAKENHI Grant Number JP18K18106. Atsushi Nitanda is\nsupported by JSPS KAKENHI Grant Number JP19K20337.\n\nReferences\nCharu C Aggarwal. Outlier Analysis Second Edition. Springer, 2016.\n\nJaved A Aslam and Scott E Decatur. On the sample complexity of noise-tolerant learning. Information\n\nProcessing Letters, 57(4):189\u2013195, 1996.\n\nRJ Beckman and HJ Trussell. The distribution of an arbitrary studentized residual and the effects of\nupdating in multiple regression. Journal of the American Statistical Association, 69(345):199\u2013201,\n1974.\n\nMarkus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J\u00f6rg Sander. Lof: identifying density-\nbased local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on\nManagement of Data, volume 29, pages 93\u2013104. ACM, 2000.\n\nCarla E Brodley and Mark A Friedl. Identifying mislabeled training data. Journal of arti\ufb01cial\n\nintelligence research, 11:131\u2013167, 1999.\n\nR Dennis Cook and Sanford Weisberg. Characterizations of an empirical in\ufb02uence function for\n\ndetecting in\ufb02uential cases in regression. Technometrics, 22(4):495\u2013508, 1980.\n\nR Dennis Cook. Detection of in\ufb02uential observation in linear regression. Technometrics, 19(1):15\u201318,\n\n1977.\n\nDheeru Dua and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017.\n\nSaeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic\n\nprogramming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\nAmirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning.\n\nIn Proceedings of the 36th International Conference on Machine Learning, 2019.\n\nRajiv Khanna, Been Kim, Joydeep Ghosh, and Oluwasanmi Koyejo. Interpreting black box pre-\ndictions using \ufb01sher kernels. In Proceedings of the 22nd International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 3382\u20133390, 2019.\n\nPang Wei Koh and Percy Liang. Understanding black-box predictions via in\ufb02uence functions. In\nProceedings of the 34th International Conference on Machine Learning, pages 1885\u20131894, 2017.\n\nAlex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\nYann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nFei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 Eighth IEEE International\n\nConference on Data Mining, pages 413\u2013422. IEEE, 2008.\n\nDougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization\nthrough reversible learning. In Proceedings of the 32nd International Conference on Machine\nLearning, pages 2113\u20132122, 2015.\n\nNagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with\n\nnoisy labels. In Advances in neural information processing systems, pages 1196\u20131204, 2013.\n\nAndrew Ng. Machine learning yearning, 2017.\n\nDaryl Pregibon. Logistic regression diagnostics. The Annals of Statistics, 9(4):705\u2013724, 1981.\n\n9\n\n\fMengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for\nrobust deep learning. In Proceedings of the 35th International Conference on Machine Learning,\n2018.\n\nThomas Schlegl, Philipp Seeb\u00f6ck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg\nLangs. Unsupervised anomaly detection with generative adversarial networks to guide marker\ndiscovery. In International Conference on Information Processing in Medical Imaging, pages\n146\u2013157. Springer, 2017.\n\nBernhard Sch\u00f6lkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson.\nEstimating the support of a high-dimensional distribution. Neural computation, 13(7):1443\u20131471,\n2001.\n\nXuezhou Zhang, Xiaojin Zhu, and Stephen Wright. Training set debugging using trusted items. In\n\nProceedings of the 32nd AAAI Conference on Arti\ufb01cial Intelligence, pages 4482\u20134489, 2018.\nChong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders.\n\nIn\nProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 665\u2013674. ACM, 2017.\n\n10\n\n\f", "award": [], "sourceid": 2373, "authors": [{"given_name": "Satoshi", "family_name": "Hara", "institution": "Osaka University"}, {"given_name": "Atsushi", "family_name": "Nitanda", "institution": "The University of Tokyo / RIKEN"}, {"given_name": "Takanori", "family_name": "Maehara", "institution": "RIKEN AIP"}]}