{"title": "Detecting Overfitting via Adversarial Examples", "book": "Advances in Neural Information Processing Systems", "page_first": 7858, "page_last": 7868, "abstract": "The repeated community-wide reuse of test sets in popular benchmark problems raises doubts about the credibility of  reported test-error rates. Verifying whether a learned model is overfitted to a test set is challenging as independent test sets drawn from the same data distribution are usually unavailable, while other test sets may introduce a distribution shift. We propose a new hypothesis test that uses only the original test data to detect overfitting. It utilizes a new unbiased error estimate that is based on adversarial examples generated from the test data and importance weighting. Overfitting is detected if this error estimate is sufficiently different from the original test error rate. We develop a specialized variant of our test for multiclass image classification, and apply it to testing overfitting of recent models to the popular ImageNet benchmark. Our method correctly indicates overfitting of the trained model to the training set, but is not able to detect any overfitting to the test set, in line with other recent work on this topic.", "full_text": "Detecting Over\ufb01tting via Adversarial Examples\n\nRoman Werpachowski\n\nAndr\u00e1s Gy\u00f6rgy\nDeepMind, London, UK\n\nCsaba Szepesv\u00e1ri\n\n{romanw,agyorgy,szepi}@google.com\n\nAbstract\n\nThe frequent reuse of test sets in popular benchmark problems raises doubts about\nthe credibility of reported test-error rates. Verifying whether a learned model is\nover\ufb01tted to a test set is challenging as independent test sets drawn from the same\ndata distribution are usually unavailable, while other test sets may introduce a\ndistribution shift. We propose a new hypothesis test that uses only the original test\ndata to detect over\ufb01tting. It utilizes a new unbiased error estimate that is based\non adversarial examples generated from the test data and importance weighting.\nOver\ufb01tting is detected if this error estimate is suf\ufb01ciently different from the original\ntest error rate. We develop a specialized variant of our test for multiclass image\nclassi\ufb01cation, and apply it to testing over\ufb01tting of recent models to the popular\nImageNet benchmark. Our method correctly indicates over\ufb01tting of the trained\nmodel to the training set, but is not able to detect any over\ufb01tting to the test set, in\nline with other recent work on this topic.\n\n1\n\nIntroduction\n\nDeep neural networks achieve impressive performance on many important machine learning bench-\nmarks, such as image classi\ufb01cation [18, 19, 28, 27, 16], automated translation [2, 31] or speech\nrecognition [9, 15]. However, the benchmark datasets are used a multitude of times by researchers\nworldwide. Since state-of-the-art methods are selected and published based on their performance\non the corresponding test set, it is typical to see results that continuously improve over time; see,\ne.g., the discussion of Recht et al. [25] and Figure 1 for the performance improvement of classi\ufb01ers\npublished for the popular CIFAR-10 image classi\ufb01cation benchmark [18].\nThis process may naturally lead to models over-\n\ufb01tted to the test set, rendering test error rate\n(the average error measured on the test set) an\nunreliable indicator of the actual performance.\nDetecting whether a model is over\ufb01tted to the\ntest set is challenging, since independent test\nsets drawn from the same data distribution are\ngenerally not available, while alternative test\nsets often introduce a distribution shift.\nTo estimate the performance of a model on un-\nseen data, one may use generalization bounds\nto get upper bounds on the expected error rate.\nThe generalization bounds are also applicable\nwhen the model and the data are dependent (e.g.,\nfor cross validation or for error estimates based\non the training data or the reused test data), but they usually lead to loose error bounds. Therefore,\nalthough much tighter bounds are available if the test data and the model are independent, comparing\n\nFigure 1: Accuracy of image classi\ufb01ers on the CIFAR-10\ntest set, by year of publication (data from [25]).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n20102012201420162018year0.800.850.900.951.00accuracy\fcon\ufb01dence intervals constructed around the training and test error rates leads to an underpowered test\nfor detecting the dependence of a model on the test set. Recently, several methods have been proposed\nthat allow the reuse of the test set while keeping the validity of test error rates [10]. However, these\nare intrusive: they require the user to follow a strict protocol of interacting with the test set and are\nthus not applicable in the more common situation when enforcing such a protocol is impossible.\nIn this paper we take a new approach to the challenge of detecting over\ufb01tting of a model to the test\nset, and devise a non-intrusive statistical test that does not restrict the training procedure and is based\non the original test data. To this end, we introduce a new error estimator that is less sensitive to\nover\ufb01tting to the data; our test rejects the independence of the model and the test data if the new\nerror estimate and the original test error rate are too different. The core novel idea is that the new\nestimator is based on adversarial examples [14], that is, on data points1 that are not sampled from the\ndata distribution, but instead are cleverly crafted based on existing data points so that the model errs\non them. Several authors showed that the best models learned for the above-mentioned benchmark\nproblems are highly sensitive to adversarial attacks [14, 23, 30, 6, 7, 24]: for instance, one can often\ncreate adversarial versions of images properly classi\ufb01ed by a state-of-the-art model such that the\nmodel will misclassify them, yet the adversarial perturbations are (almost) undetectable for a human\nobserver; see, e.g., Figure 2, where the adversarial image is obtained from the original one by a\ncarefully selected translation.\nThe adversarial (error) estimator proposed in\nthis work uses adversarial examples (generated\nfrom the test set) together with importance\nweighting to take into account the change in\nthe data distribution (covariate shift) due to the\nadversarial transformation. The estimator is un-\nbiased and has a smaller variance than the stan-\ndard test error rate if the test set and the model\nare independent.2 More importantly, since it is\nbased on adversarially generated data points, the\nadversarial estimator is expected to differ sig-\nni\ufb01cantly from the test error rate if the model\nis over\ufb01tted to the test set, providing a way to\ndetect test set over\ufb01tting. Thus, the test error\nrate and the adversarial error estimate (calcu-\nlated based on the same test set) must be close if the test set and the model are independent, and are\nexpected to be different in the opposite case. In particular, if the gap between the two error estimates\nis large, the independence hypothesis (i.e., that the model and the test set are independent) is dubious\nand will be rejected. Combining results from multiple training runs, we develop another method to\ntest over\ufb01tting of a model architecture and training procedure (for simplicity, throughout the paper\nwe refer to both together as the model architecture). The most challenging aspect of our method is to\nconstruct adversarial perturbations for which we can calculate importance weights, while keeping\nenough degrees of freedom in the way the adversarial perturbations are generated to maximize power,\nthe ability of the test to detect dependence when it is present.\nTo understand the behavior of our tests better, we \ufb01rst use them on a synthetic binary classi\ufb01cation\nproblem, where the tests are able to successfully identify the cases where over\ufb01tting is present. Then\nwe apply our independence tests to state-of-the-art classi\ufb01cation methods for the popular image\nclassi\ufb01cation benchmark, ImageNet [8]. As a sanity check, in all cases examined, our test rejects\n(at con\ufb01dence levels close to 1) the independence of the individual models from their respective\ntraining sets. Applying our method to VGG16 [27] and Resnet50 [16] models/architectures, their\nindependence to the ImageNet test set cannot be rejected at any reasonable con\ufb01dence. This is in\nagreement with recent \ufb01ndings of [26], and provides additional evidence that despite of the existing\ndanger, it is likely that no over\ufb01tting has happened during the development of ImageNet classi\ufb01ers.\nThe rest of the paper is organized as follows: In Section 2, we introduce a formal model for error\nestimation using adversarial examples, including the de\ufb01nition of adversarial example generators.\n\nFigure 2: Adversarial example for the ImageNet dataset\ngenerated by a (5,\u22125) translation: the original example\n(left) is correctly classi\ufb01ed by the VGG16 model [27] as\n\u201cscale, weighing machine,\u201d the adversarially generated\nexample (right) is classi\ufb01ed as \u201ctoaster,\u201d while the image\nclass is the same for any human observer.\n\nscale, weighing machine\n\ntoaster\n\n1Throughout the paper, we use the words \u201cexample\u201d and \u201cpoint\u201d interchangeably.\n2Note that the adversarial error estimator\u2019s goal is to estimate the error rate, not the adversarial error rate (i.e.,\n\nthe error rate on the adversarial examples).\n\n2\n\n\fThe new over\ufb01tting-detection tests are derived in Section 3, and applied to a synthetic problem in\nSection 4, and to the ImageNet image classi\ufb01cation benchmark in Section 5. Due to space limitations,\nsome auxiliary results, including the in-depth analysis of our method on the synthetic problem, are\nrelegated to the appendix.\n2 Adversarial Risk Estimation\n\nWe consider a classi\ufb01cation problem with deterministic (noise-free) labels, which is a reasonable\nassumption for many practical problems, such as image recognition (we leave the extension of our\nmethod to noisy labels for future work). Let X \u2282 RD denote the input space and Y = {0, . . . , K\u22121}\nthe set of labels. Data is sampled from the distribution P over X , and the class label is determined\nby the ground truth function f\u2217 : X \u2192 Y. We denote a random vector drawn from P by X, and its\ncorresponding class label by Y = f\u2217(X). We consider deterministic classi\ufb01ers f : X \u2192 Y. The\nperformance of f is measured by the zero-one loss: L(f, x) = I(f (x) (cid:54)= f\u2217(x)),3 and the expected\nerror (also known as the risk or expected risk in the learning theory literature) of the classi\ufb01er f is\n\nde\ufb01ned as R(f ) = E[I(f (X) (cid:54)= Y )] =(cid:82)\n\nX L(f, x)dP(x).\n\nConsider a test dataset S = {(X1, Y1) . . . , (Xm, Ym)} where the Xi are drawn from P independently\nof each other and Yi = f\u2217(Xi). In the learning setting, the classi\ufb01er f usually also depends on some\nrandomly drawn training data, hence is random itself. If f is (statistically) independent from S, then\nL(f, X1), . . . , L(f, Xm) are i.i.d., thus the empirical error rate\n\nm(cid:88)\n\n(cid:98)RS(f ) =\n\n1\nm\n\nm(cid:88)\n\n1\nm\n\nI(f (Xi) (cid:54)= Yi)\n\nis an unbiased estimate of R(f ) for all f; that is, R(f ) = E[(cid:98)RS(f )|f ]. If f and S are not indepen-\n\ni=1\n\ni=1\n\nL(f, Xi) =\n\ndent, the performance guarantees on the empirical estimates available in the independent case are\nsigni\ufb01cantly weakened; for example, in case of over\ufb01tting to S, the empirical error rate is likely to be\nmuch smaller than the expected error.\nAnother well-known way to estimate R(f ) is to use importance sampling (IS) [17]: instead of\nsampling from the distribution P, we sample from another distribution P(cid:48) and correct the estimate\nby appropriate reweighting. Assuming P is absolutely continuous with respect to P(cid:48) on the set\nE L(f, x)h(x)dP(cid:48)(x), where h = dP\ndP(cid:48)\nis the density (Radon-Nikodym derivative) of P with respect to P(cid:48) on E (h can be de\ufb01ned to have\narbitrary \ufb01nite values on X \\ E). It is well known that the the corresponding empirical error estimator\n\nE = {x \u2208 X : L(f, x) (cid:54)= 0}, R(f ) =(cid:82)\n\nX L(f, x)dP(x) =(cid:82)\nm(cid:88)\n1\nm\nm)} drawn independently from P(cid:48) is unbiased\nm, Y (cid:48)\n\nobtained from a sample S(cid:48) = {(X(cid:48)\n\n(i.e., E[(cid:98)RS(cid:48)(f )|f ] = R(f )) if f and S(cid:48) are independent.\nThe variance of (cid:98)R(cid:48)\nfacilitates that (cid:98)R(cid:48)\n\nS(cid:48) is minimized if P(cid:48) is the so-called zero-variance IS distribution, which is\nL(f,x) for all x \u2208 E (see, e.g., [4, Section 4.2]). This suggest that an\nsupported on E with h(x) = R(f )\nS(cid:48)(f ) become large if f is over\ufb01tted to S and hence (cid:98)RS(f ) is small. We achieve this\neffective sampling distribution P(cid:48) should concentrate on points where f makes mistakes, which also\n\ni)h(X(cid:48)\n1 ), . . . , (X(cid:48)\n\ni )h(X(cid:48)\ni)\n\ni) (cid:54)= Y (cid:48)\n\nm(cid:88)\n\nL(f, X(cid:48)\n\nI(f (X(cid:48)\n\ni=1\n\n1, Y (cid:48)\n\nS(cid:48)(f ) =\n\n1\nm\n\n(cid:98)R(cid:48)\n\ni) =\n\n(1)\n\ni=1\n\nthrough the application of adversarial examples.\n\n2.1 Generating adversarial examples\n\nIn this section we introduce a formal framework for generating adversarial examples. Given a\nclassi\ufb01cation problem with data distribution P and ground truth f\u2217, an adversarial example generator\n(AEG) for a classi\ufb01er f is a (measurable) mapping g : X \u2192 X such that\n\n(G1) g preserves the class labels of the samples, that is, f\u2217(x) = f\u2217(g(x)) for P-almost all x;\n(G2) g does not change points that are incorrectly classi\ufb01ed by f, that is, g(x) = x if f (x) (cid:54)=\n\nf\u2217(x) for P-almost all x.\n\n3For an event B, I(B) denotes its indicator function: I(B) = 1 if B happens and I(B) = 0 otherwise.\n\n3\n\n\fFigure 3: Generating adversarial examples. The top row depicts the original dataset S, with blue and orange\npoints representing the two classes. The classi\ufb01er\u2019s prediction is represented by the color of the striped areas\n(checkmarks and crosses denote if a point is correctly or incorrectly classi\ufb01ed). The arrows show the adversarial\ntransformations via the AEG g, resulting in the new dataset S(cid:48); misclassi\ufb01ed points are unchanged, while some\ncorrectly classi\ufb01ed points are moved, but their original class label is unchanged. If the original data distribution is\nuniform over S, the transformation g is density preserving, but not measure preserving: after the transformation\nthe two rightmost correctly classi\ufb01ed points in each class have probability 0, while the leftmost misclassi\ufb01ed\npoint in each class has probability 3/16; hence, the density hg for the latter points is 1/3.\n\nFigure 3 illustrates how an AEG works. In the literature, an adversarial example g(x) is usually\ngenerated by staying in a small vicinity of the original data point x (with respect to, e.g., the 2- or the\nmax-norm) and assuming that the resulting label of g(x) is the same as that of x (see, e.g., [14, 6]).\nThis foundational assumption\u2014which is in fact a margin condition on the distribution\u2014is captured\nin condition (G1). (G2) formalizes the fact that there is no need to change samples which are already\nmisclassi\ufb01ed. Indeed, existing AEGs comply with this condition.\nThe performance of an AEG is usually measured by how successfully it generates misclassi\ufb01ed\nexamples. Accordingly, we call a point g(x) a successful adversarial example if x is correctly\nclassi\ufb01ed by f and f (g(x)) (cid:54)= f (x) (i.e., L(f, x) = 0 and L(f, g(x)) = 1).\nIn the development of our AEGs for image recognition tasks, we will make use of another condition.\nFor simplicity, we formulate this condition for distributions P that have a density \u03c1 with respect\nto the uniform measure on X , which is assumed to exist (notable cases are when X is \ufb01nite, or\nX = [0, 1]D or when X = RD; in the latter two cases the uniform measure is the Lebesgue measure).\nThe assumption states that the AEG needs to be density-preserving:\n\n(G3) \u03c1(x) = \u03c1(g(x)) for P-almost all x.\n\nNote that a density-preserving map may not be measure-preserving (the latter means that for all\nmeasurable A \u2282 X , P(A) = P(g(A))).\nWe expect (G3) to hold when g perturbs its input by a small amount and if \u03c1 is suf\ufb01ciently smooth.\nThe assumption is reasonable for, e.g., image recognition problems (at least in a relaxed form,\n\u03c1(x) \u2248 \u03c1(g(x))) where we expect that very close images will have a similar likelihood as measured\nby \u03c1. An AEG employing image translations, which satis\ufb01es (G3), will be introduced in Section 5.\nBoth (G1) and (G3) can be relaxed (to a soft margin condition or allowing a slight change in \u03c1, resp.)\nat the price of an extra error term in the analysis that follows.\nFor a \ufb01xed AEG g : X \u2192 X , let Pg be the distribution of g(X) where X \u223c P (Pg is known as the\npushforward measure of P under g). Further, let hg = dP\non E = {x : L(f, x) (cid:54)= 0} and arbitrary\ndPg\notherwise. It is easy to see that, on E, hg(x) is well-de\ufb01ned and hg \u2264 1. For any measurable A \u2282 E\n\nPg(A) = P(g(X) \u2208 A) \u2265 P(g(X) \u2208 A, X \u2208 E) = P(X \u2208 A) = P(A)\n\nwhere the second to last equality holds because g(X) = X for any X \u2208 E under condition (G2).\nThus, P(A) \u2264 Pg(A) for any measurable A \u2282 E, which implies that hg is well-de\ufb01ned on E and\nhg(x) \u2264 1 for all x \u2208 E.\nOne may think that (G3) implies that hg(x) = 1 for all x \u2208 E. However, this does not hold. For\nexample, if P is a uniform distribution, any g : X \u2192 suppP satis\ufb01es (G3), where suppP \u2282 X\ndenotes the support of the distribution P. This is also illustrated in Figure 3.\n\n2.2 Risk estimation via adversarial examples\n\nCombining the ideas of this section so far, we now introduce unbiased risk estimates based on\nadversarial examples. Our goal is to estimate the error-rate of f through an adversarially generated\n\n4\n\n\u0013\u0013\u0013\u0013\u0017\u0017\u0017\u0017\u0013\u0013\u0013\u0013\u0013\u0013\u0013\u0013\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017SS(cid:48)g\fm(cid:88)\n\n1, Y1), . . . , (X(cid:48)\n\nsample S(cid:48) = {(X(cid:48)\ni = g(Xi) with\nX1, . . . , Xm drawn independently from P and Yi = f\u2217(Xi). Since g satis\ufb01es (G1) by de\ufb01nition, the\noriginal example Xi and the corresponding adversarial example X(cid:48)\ni have the same label Yi. Recalling\nthat hg = dP/dPg \u2264 1 on E = {x \u2208 X : L(f, x) = 1}, one can easily show that the importance\n\nm, Ym)} obtained through an AEG g, where X(cid:48)\n\nI(f (X(cid:48)\n\ni) (cid:54)= Yi)hg(X(cid:48)\ni)\n\nobtained from (1) for the adversarial sample S(cid:48) has smaller variance than that of the empirical average\n\nweighted adversarial estimate(cid:98)Rg(f ) =\n(cid:98)RS(f ), while both are unbiased estimates of R(f ). Recall that both (cid:98)Rg(f ) and (cid:98)RS(f ) are unbiased\nestimates of R(f ) with expectation E[(cid:98)Rg(f )] = E[(cid:98)RS(f )] = R(f ), and so\nV[(cid:98)Rg(f )] =\n(cid:0)R(f ) \u2212 R2(f )(cid:1) = V[(cid:98)RS(f )] .\n\n(cid:0)E[L(f, g(X))2hg(g(X))2] \u2212 R(f )2(cid:1)\n(cid:0)E[L(f, g(X))hg(g(X))] \u2212 R2(f )(cid:1) =\n\n1\nm\n\n(2)\n\ni=1\n\n1\nm\n\u2264 1\nm\n\n1\nm\n\nIntuitively, the more successful the AEG is (i.e., the more classi\ufb01cation error it induces), the smaller\n\nthe variance of the estimate (cid:98)Rg(f ) becomes.\n\n3 Detecting over\ufb01tting\n\nIn this section we show how the risk estimates introduced in the previous section can be used to test\nthe independence hypothesis that\n\n(H) the sample S and the model f are independent.\n\nIf (H) holds, E[(cid:98)Rg(f )] = E[(cid:98)RS(f )] = R(f ), and so the difference TS,g(f ) = (cid:98)Rg(f ) \u2212 (cid:98)RS(f )\n(cid:98)RS(f ) < R(f )), we expect (cid:98)RS(f ) and (cid:98)Rg(f ) to behave differently (the latter being less sensitive to\nover\ufb01tting) since (i) (cid:98)Rg(f ) depends also on examples previously unseen by the training procedure;\n\nis expected to be small. On the other hand, if f is over\ufb01tted to the dataset S (in which case\n\n(ii) the adversarial transformation g aims to increase the loss, countering the effect of over\ufb01tting;\n(iii) especially in high dimensional settings, in case of over\ufb01tting one may expect that there are\nmisclassi\ufb01ed points very close to the decision boundary of f which can be found by a carefully\ndesigned AEG. Therefore, intuitively, (H) can be rejected if |TS,g(f )| exceeds some appropriate\nthreshold.\n\n3.1 Test based on con\ufb01dence intervals\n\nThe simplest way to determine the threshold is based on constructing con\ufb01dence intervals for\nthese estimator based on concentration inequalities. Under (H), standard concentration inequal-\nities, such as the Chernoff or empirical Bernstein bounds [3], can be used to quantify how\nIn particular, we use the\n\nfast (cid:98)RS and (cid:98)Rg(f ) concentrate around the expected error R(f ).\nS = (1/m)(cid:80)m\ng = (1/m)(cid:80)m\n\ni=1(L(f, Xi) \u2212 (cid:98)RS(f ))2 and\ni=1(L(f, g(Xi))hg(g(Xi)) \u2212 (cid:98)Rg(f ))2 denote the empirical variance of L(f, Xi)\n\nfollowing empirical Bernstein bound [22]: Let \u00af\u03c32\n\u00af\u03c32\nand L(f, g(Xi))hg(g(Xi)), respectively. Then, for any 0 < \u03b4 \u2264 1, with probability at least 1 \u2212 \u03b4,\n(3)\n\n|(cid:98)RS(f ) \u2212 R(f )| \u2264 B(m, \u00af\u03c32\n\nS, \u03b4, 1),\n\nwhere B(m, \u03c32, \u03b4, 1) =\nand we used the fact that the range of L(f, x) is 1\n(the last parameter of B is the range of the random variables considered). Similarly, with probability\nat least 1 \u2212 \u03b4,\n\n+ 3 ln(3/\u03b4)\n\nm\n\nm\n\n|(cid:98)Rg(f ) \u2212 R(f )| \u2264 B(m, \u00af\u03c32\n\ng, \u03b4, 1).\n\nIt follows trivially from the union bound that if the independence hypothesis (H) holds, the\n\nabove two con\ufb01dence intervals [(cid:98)RS(f ) \u2212 B(m, \u00af\u03c32\n\nS, \u03b4, 1),(cid:98)RS(f ) + B(m, \u00af\u03c32\n\nS, \u03b4, 1)] and [(cid:98)Rg(f ) \u2212\n\n(4)\n\n(cid:113) 2\u03c32 ln(3/\u03b4)\n\n5\n\n\fg, \u03b4, 1),(cid:98)RS(f ) + B(m, \u00af\u03c32\n\ng, \u03b4, 1)], which both contain R(f ) with probability at least 1 \u2212 \u03b4,\n\nB(m, \u00af\u03c32\nintersect with probability at least 1 \u2212 2\u03b4.\nOn the other hand, if f and S are not independent, the performance guarantees (3) and (4) may be\nviolated and the con\ufb01dence intervals may become disjoint. If this is detected, we can reject the\nindependence hypothesis (H) at a con\ufb01dence level 1 \u2212 2\u03b4 or, equivalently, with p-value 2\u03b4. In other\n\nwords, we reject (H) if the absolute value of the difference of the estimates TS,g(f ) = (cid:98)Rg(f )\u2212(cid:98)RS(f )\n\nS, \u03b4, 1) + B(m, \u00af\u03c32\n\ng, \u03b4, 1) (note that E[TS,g(f ) = 0] if S and f are\n\nexceeds the threshold B(m, \u00af\u03c32\nindependent).\n\n3.2 Pairwise test\nA smaller threshold for |TS,g(f )|, and hence a more effective independence test, can be devised\n\nif instead of independently estimating the behavior of (cid:98)RS and (cid:98)Rg(f ), one utilizes their apparent\ncorrelation. Indeed, TS,g(f ) = (1/m)(cid:80)m\n\nTi,g(f ) = L(f, g(Xi))hg(g(Xi)) \u2212 L(f, Xi)\n\n(5)\nand the two terms in Ti,g(f ) have the same mean and are typically highly correlated by the con-\nstruction of g. Thus, we can apply the empirical Bernstein bound [22] to the pairwise differences\nTi,g(f ) to set a tighter threshold in the test: if the independence hypothesis (H) holds (i.e., S and f\nare independent), then for any 0 < \u03b4 < 1, with probability at least 1 \u2212 \u03b4,\n\ni=1 Ti,g(f ) where\n\nT , \u03b4, U )\n\n|TS,g(f )| \u2264 B(m, \u00af\u03c32\n+ 3U ln(3/\u03b4)\n\n(6)\ni=1(Ti(f ) \u2212 TS,g(f ))2 is\nwith B(m, \u03c32, \u03b4, U ) =\nthe empirical variance of the Ti,g(f ) terms and U = sup Ti,g(f ) \u2212 inf Ti,g(f ); we also used the fact\nthat the expectation of each Ti,g(f ), and hence that of TS,g(f ), is zero. Since hg \u2264 1 if L(f, x) = 1\n(as discussed in Section 2.2), it follows that U \u2264 2, but further assumptions (such as g being density\npreserving) can result in tighter bounds.\nThis leads to our pairwise dependence detection method:\n\nT = (1/m)(cid:80)m\n\n, where \u00af\u03c32\n\n(cid:113) 2\u03c32 ln(3/\u03b4)\n\nm\n\nm\n\nT , \u03b4, 2), reject (H) at a con\ufb01dence level 1 \u2212 \u03b4 (p-value \u03b4).\nif |TS,g(f )| > B(m, \u00af\u03c32\nFor a given statistic (|TS,g(f )|, \u00af\u03c32\nT ), the largest con\ufb01dence level (smallest p-value) at which (H) can\nbe rejected can be calculated by setting the value of the statistic |TS,g(f )| \u2212 B(m, \u00af\u03c32\nT , \u03b4, 2) to zero\nand solving for \u03b4. This leads to the following formula for the p-value (if the solution is larger than 1,\nwhich happens when the bound (6) is loose, \u03b4 is capped at 1):\n\u221a\n\nT +6U|TS,g(f )|(cid:17)(cid:27)\n\n\u00af\u03c32\n\nT +3U|TS,g(f )|\u2212\u00af\u03c3T\n\u00af\u03c32\n\n(cid:26)\n\n(cid:16)\n\n\u2212 m\n9U 2\n\n.\n\n(7)\n\n\u03b4 = min\n\n1, 3e\n\nNote that in order for the test to work well, we not only need the test statistic TS,g(f ) to have a\nsmall variance in case of independence (this could be achieved if g were the identity), but we also\n\nneed the estimators (cid:98)RS(f ) and (cid:98)Rg(f ) behave suf\ufb01ciently differently if the independence assumption\n\nis violated. The latter behavior is encouraged by stronger AEGs, as we will show empirically in\nSection 5.2 (see Figure 5 in particular).\n\n3.3 Dependence detector for randomized training\n\nThe dependence between the model and the test set can arise from (i) selecting the \u201cbest\u201d random\nseed in order to improve the test set performance and/or (ii) tweaking the model architecture (e.g.,\nneural network structure) and hyperparameters (e.g., learning-rate schedule). If one has access to a\nsingle instance of a trained model, these two sources cannot be disentangled. However, if the model\narchitecture and training procedure is fully speci\ufb01ed and computational resources are adequate, it\nis possible to isolate (i) and (ii) by retraining the model multiple times and calculating the p-value\nfor every training run separately. Assuming N models, let fj, j = 1, . . . , N denote the j-th trained\nmodel and pj the p-value calculated using the pairwise independence test (6) (i.e., from Eq. 7 in\nSection 3). We can investigate the degree to which (i) occurs by comparing the pj values with the\ncorresponding test set error rates RS(fj). To investigate whether (ii) occurs, we can average over the\nrandomness of the training runs.\n\n6\n\n\fdetector (6) with \u00afTi instead of Ti, using the average \u00afTS = (1/m)(cid:80)m\nT,N = (1/m)(cid:80)m\n\nFor every example Xi \u2208 S, consider the average test statistic \u00afTi = 1\nj=1 Ti,gj (fj), where\nTi,gj (fj) is the statistic (5) calculated for example Xi and model fj with AEG gj selected for model\nfj (note that AEGs are model-dependent by construction). If, for each i and j, the random variables\nTi(fj) are independent, then so are the \u00afTi (for all i). Hence, we can apply the pairwise dependence\n\u00afTi with empirical variance\ni=1( \u00afTi \u2212 \u00afTS)2, giving a single p-value pN . If the training runs vary enough in their\n\u00af\u03c32\noutcomes, different models fj err on different data points Xj, leading to \u00af\u03c32\nT , and therefore\nstrengthening the power of the dependence detector. For brevity, we call this independence test an\nN-model test.\n\nT,N < \u00af\u03c32\n\n(cid:80)N\n\nN\n\ni=1\n\n4 Synthetic experiments\n\nFirst we verify the effectiveness of our method\non a simple linear classi\ufb01cation problem. Due\nto space limitations, we only convey high-level\nresults here, details are given in Appendix A.\nWe assume that the data is linearly separa-\nble with a margin and the density \u03c1 is known.\nWe consider a linear classi\ufb01ers of the form\nf (x) = sgn(w(cid:62)x + b) trained with the cross-\nentropy loss c, and we employ a one-step gra-\ndient method (which is an L2 version of the\nfast gradient-sign method of [14, 23]) to de\ufb01ne\nour AEG g, which tries to modify a correctly\nclassi\ufb01ed point x with label y in the direction\nof the gradient of the cost function, yielding\nx(cid:48) = x\u2212\u03b5yw/(cid:107)w(cid:107)2, where \u03b5 \u2265 0 is the strength\nof the attack. To comply with the requirements\nfor an AEG, we de\ufb01ne g as follows: g(x) = x(cid:48) if L(f, x) = 0 and f\u2217(x) = f\u2217(x(cid:48)) (corresponding\nto (G2) and (G1), respectively), while g(x) = x otherwise. Therefore, if x(cid:48) is misclassi\ufb01ed by f, x\nand x(cid:48) are the only points mapped to x(cid:48) by g. This simple form of g and the knowledge of \u03c1 allows to\ncompute the density hg, making it easy to compute the adversarial error estimate (2). Figure 4 shows\nthe average p-values produced by our N-model independence test for a dependent (solid lines) and\nan independent (dashed lines) test set. It can be seen that in the dependent case the test can reject\nindependence with high con\ufb01dence for a large range of attack strength \u03b5, while the independence\nhypothesis is not rejected in the case of true independence. More details (including why only a range\nof \u03b5 is suitable for detecting over\ufb01tting) are given in Appendix A.\n\nFigure 4: Average p-values produced by the indepen-\ndence test in a separable linear classi\ufb01cation problem\nfor the cases of both when the model is independent of\n(dashed lines) and, resp., dependent on (solid lines) the\ntest set.\n\n5 Testing over\ufb01tting on ImageNet\n\nIn the previous section we showed that the proposed adversarial-example-based dependence test\nworks for a synthetic problem where the densities can be computed exactly. In this section we apply\nour estimates to a popular image classi\ufb01cation benchmark, ImageNet [8]; here the main issue is to\n\ufb01nd suf\ufb01ciently strong AEGs that make computing the corresponding densities possible.\nTo facilitate the computation of the density hg, we only consider density-preserving AEGs as de\ufb01ned\nby (G3) (recall that (G3) is different from requiring hg = 1). Since in (2) and (5), hg(x) is multiplied\nby L(f, x), we only need to determine the density hg for data points that are misclassi\ufb01ed by f.\n\n5.1 AEGs based on translations\n\nTo satisfy (G3), we implement the AEG using translations of images, which have recently been\nproposed as means of generating adversarial examples [1]. Although relatively weak, such attacks \ufb01t\nour needs well: unless the images are procedurally centered, it is reasonable to assume that translating\nthem by a few pixels does not change their likelihood.4 We also make the natural assumption that\nthe small translations used do not change the true class of an image. Under these assumptions,\n\n4Note that this assumption limits the applicability of our method, excluding such centered or essentially\n\ncentered image classi\ufb01cation benchmarks as MNIST [20] or CIFAR-10 [18].\n\n7\n\n10-210-1100101102\u00010.00.20.40.60.81.0p-valueN=1N=2N=10N=25N=100N=1N=2N=10N=25N=100\ftranslations by a few pixels satisfy conditions (G1) and (G3). An image-translating function g is a\nvalid AEG if it leaves all misclassi\ufb01ed images in place (to comply with (G2)), and either leaves a\ncorrectly classi\ufb01ed image unchanged or applies a small translation.\nThe main bene\ufb01t of using a translational AEG g (with bounded translations) is that its density hg(x)\nfor an image x can be calculated exactly by considering the set of images x(cid:48) that can be mapped to x\nby g (this is due to our assumption (G3)). We considered multiple ways for constructing translational\nAEGs. The best version (selected based on initial evaluations on the ImageNet training set), which\nwe called the strongest perturbation, seeks a non-identical neighbor of a correctly classi\ufb01ed image\nx (neighboring images are the ones that are accessible through small translations) that causes the\nclassi\ufb01er to make an error with the largest con\ufb01dence.\nFormally, we model images as 3D tensors in [0, 1]W\u00d7H\u00d7C space, where C = 3 for RGB data, and\nW and H are the width and height of the images, respectively. Let \u03c4v(x) denote the translation of\nan image x by v \u2208 Z2 pixels in the (X, Y) plane (here Z denotes the set of integers). To control\nthe amount of change, we limit the magnitude of translations and allow v \u2208 V\u03b5 = {u \u2208 Z2 :\nu (cid:54)= (0, 0),(cid:107)u(cid:107)\u221e \u2264 \u03b5} only, for some \ufb01xed positive \u03b5. Thus, we considers AEGs in the form\ng(x) \u2208 {\u03c4v(x) : v \u2208 V} \u222a {x} if f (x) = f\u2217(x) and g(x) = x otherwise (if x is correctly classi\ufb01ed,\nwe attempt to translate it to \ufb01nd an adversarial example in {\u03c4v(x) : v \u2208 V} which is misclassi\ufb01ed by\nf, but x is left unchanged if no such point exists). Denoting the density of the pushforward measure\nPg by \u03c1g, for any misclassi\ufb01ed point x,\n\n(cid:33)\n\n\u03c1g(x) = \u03c1(x) +\n\n\u03c1(\u03c4\u2212v(x))I(g(\u03c4\u2212v(x)) = x) = \u03c1(x)\n\n1 +\n\nI(g(\u03c4\u2212v(x)) = x)\n\n(cid:32)\n\n(cid:88)\n\nv\u2208V\n\n(cid:88)\n\nv\u2208V\n\nwhere the second equality follows from (G3). Therefore, the corresponding density is\n\nwhere n(x) =(cid:80)\n\nhg(x) = 1/(1 + n(x))\n\n(8)\nv\u2208V I(g(\u03c4\u2212v(x)) = x) is the number of neighboring images which are mapped to\nx by g. Note that given f and g, n(x) can be easily calculated by checking all possible translations\nof x by \u2212v for v \u2208 V. It is easy to extend the above to non-deterministic perturbations, de\ufb01ned as\ndistributions over AEGs, by replacing the indicator with its expectation P(g(\u03c4\u2212v(x)) = x|x, v) with\nrespect to the randomness of g, yielding\n\nhg(x) =\n\n(9)\nIf g is deterministic, we have hg(x) \u2264 1/2 for any successful adversarial example x. Hence, for such\ng, the range U of the random variables Ti de\ufb01ned in (5) has a tighter upper bound of 3/2 instead 2 (as\nTi \u2208 [\u22121, 1/2]), leading to a tighter bound in (6) and a stronger pairwise independence test. In the\nexperiments, we use this stronger test. We provide additional details about the translational AEGs\nused in Appendix B.\n\nv\u2208V P(g(\u03c4\u2212v(x)) = x|x, v)\n\n.\n\n1 +(cid:80)\n\n1\n\n5.2 Tests of ImageNet models\n\nWe applied our test to check if state-of-the-art classi\ufb01ers for the ImageNet dataset [8] have been\nover\ufb01tted to the test set. In particular, we use the VGG16 classi\ufb01er of [27] and the Resnet50 classi\ufb01er\nof [16]. Due to computational considerations, we only analyzed a single trained VGG16 model,\nwhile the Resnet50 model was retrained 120 times. The models were trained using the parameters\nrecommended by their respective authors.\nThe preprocessing procedure of both architectures involves rescaling every image so that the smaller\nof width and height is 256 and next cropping centrally to size 224 \u00d7 224. This means that translating\nthe image by v can be trivially implemented by shifting the cropping window by \u2212v without any loss\nof information for (cid:107)v(cid:107)\u221e \u2264 16, because we have enough extra pixels outside the original, centrally\nlocated cropping window. This implies that we can compute the densities of the translational AEGs\nfor any (cid:107)v(cid:107)\u221e \u2264 \u03b5 = (cid:98)16/3(cid:99) = 5 (see Appendix B.1 for detailed explanation). Because the ImageNet\ndata collection procedure did not impose any strict requirements on centering the images [8], it is\nreasonable to assume (as we do) that small (lossless) translations respect the density-preserving\ncondition (G3).\nIn our \ufb01rst experiment, we applied our pairwise independence test (6) with the AEGs described in\nAppendix B (strongest, nearest, and the two random baselines) to all 1,271,167 training examples, as\n\n8\n\n\fFigure 5: p-values for the independence test on the ImageNet training set for different sample sizes and AEG\n\nvariants (left); original and adversarial risk estimates, (cid:98)RS(f ) and (cid:98)Rg(f ), on the ImageNet training set with\n\n97.5% two-sided con\ufb01dence intervals for the \u2018strongest attack\u2019 AEG (right).\n\nwell as to a number of its randomly selected (uniformly without replacement) subsets of different\nsizes. Besides this being a sanity check, we also used this experiment to select from different AEGs\nand compare the performance of the pairwise independence test (6) to the basic version of the test\ndescribed in Section 3.1.\nThe left graph in Figure 5 shows that with the \u201cstrongest perturbation\u201d, we were able to reject\nindependence of the trained model and the training samples at a con\ufb01dence level very close to 1 when\nenough training samples are considered (to be precise, for the whole training set the con\ufb01dence level\nis 99.9994%). Note, however, that the much weaker \u201csmallest perturbation\u201d AEG, as well as the\nrandom transformations, are not able to detect the presence of over\ufb01tting. At the same time, the graph\non the right hand side shows the relative strength of the pairwise independence test compared to the\nbasic version based on independent con\ufb01dence interval estimates as described in detail in Section 3.1:\n\nthe 97.5%-con\ufb01dence intervals of the error estimates (cid:98)RS(f ) and (cid:98)Rg(f ) overlap, not allowing to\n\nreject independence at a con\ufb01dence level of 95% (note that here S denotes the training set).\nOn the other hand, when applied to the test set, we obtained a p-value of 0.96, not allowing at all\nto reject the independence of the trained model and the test set. This result could be explained by\nthe test being too weak, as no over\ufb01tting is detected to the training set at similar sample sizes (see\nFigure 5), or simply the lack of over\ufb01tting. Similar results were obtained for Resnet50, where even\nthe N-model test with N = 120 independently trained models resulted a p value of 1, not allowing\nto reject independence at any con\ufb01dence level. The view of no over\ufb01tting can be backed up in at\nleast two ways: \ufb01rst, \u201cmanual\u201d over\ufb01tting to the relatively large ImageNet test set is hard. Second,\nsince training an ImageNet model was just too computationally expensive until quite recently, only a\nrelatively small number of different architectures were developed for this problem, and the evolution\nof their design was often driven by computational ef\ufb01ciency on the available hardware. On the other\nhand, it is also possible that increasing N suf\ufb01ciently might show evidence of over\ufb01tting (this is left\nfor future work).\n\n6 Conclusions\n\nWe presented a method for detecting over\ufb01tting of models to datasets. It relies on an importance-\nweighted risk estimate from a new dataset obtained by generating adversarial examples from the\noriginal data points. We applied our method to the popular ImageNet image classi\ufb01cation task. For\nthis purpose, we developed a specialized variant of our method for image classi\ufb01cation that uses\nadversarial translations, providing arguments for its correctness. Luckily, and in agreement with other\nrecent work on this topic [25, 26, 13, 21, 32], we found no evidence of over\ufb01tting of state-of-the-art\nclassi\ufb01ers to the ImageNet test set.\nThe most challenging aspect of our methods is to construct adversarial perturbations for which we can\ncalculate the importance weights; \ufb01nding stronger perturbations than the ones based on translations\nfor image classi\ufb01cation is an important question for the future. Another interesting research direction\nis to consider extensions beyond image classi\ufb01cation, for example, by building on recent adversarial\nattacks for speech-to-text methods [5], machine translation [11] or text classi\ufb01cation [12].\n\n9\n\n0.20.40.60.81.01.2sample size\u00d71060.00.20.40.60.81.0P-valuevariantstrongestnearestrandomrandom20.20.40.60.81.01.2sample size\u00d71060.1740.1760.1780.1800.1820.1840.1860.188cRS(f)cRg(f)\fAcknowledgements\n\nWe thank J. Uesato for useful discussions and advice about adversarial attack methods and sharing\ntheir implementations [30] with us, as well as M. Rosca and S. Gowal for help with retraining image\nclassi\ufb01cation models. We also thank B. O\u2019Donoghue for useful remarks about the manuscript, and L.\nSchmidt for an in-depth discussion of their results on this topic. Finally, we thank D. Balduzzi, S.\nLegg, K. Kavukcuoglu and J. Martens for encouragement, support, lively discussions and feedback.\n\nReferences\n[1] Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly to small image\n\ntransformations? 2018. arXiv:1805.12177.\n\n[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\nto align and translate. In Proceedings of the International Conference on Learning Representations (ICLR),\n2015.\n\n[3] St\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic\n\nTheory of Independence. Oxford University Press, 2013.\n\n[4] James Antonio Bucklew. Introduction to Rare Event Simulation. Springer New York, 2004.\n\n[5] N. Carlini and D. Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE\n\nSecurity and Privacy Workshops (SPW), May 2018.\n\n[6] Nicholas Carlini and David A. Wagner. Adversarial examples are not easily detected: Bypassing ten\ndetection methods. In Proceedings of the 10th ACM Workshop on Arti\ufb01cial Intelligence and Security,\nAISec@CCS 2017, Dallas, TX, USA, November 3, 2017, pages 3\u201314, 2017. URL http://doi.acm.org/\n10.1145/3128572.3140444.\n\n[7] Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In 2017\nIEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pages 39\u201357,\n2017. URL https://doi.org/10.1109/SP.2017.49.\n\n[8] J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image\ndatabase. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248\u2013255, June\n2009. doi: 10.1109/CVPR.2009.5206848.\n\n[9] L. Deng, G. Hinton, and B. Kingsbury. New types of deep neural network learning for speech recognition\nand related applications: an overview. In 2013 IEEE International Conference on Acoustics, Speech and\nSignal Processing, pages 8599\u20138603. IEEE, May 2013.\n\n[10] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The\n\nreusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636\u2013638, 2015.\n\n[11] Javid Ebrahimi, Daniel Lowd, and Dejing Dou. On adversarial examples for character-level neural machine\ntranslation. In Proceedings of the 27th International Conference on Computational Linguistics, COLING\n2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 653\u2013663, 2018.\n\n[12] Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hot\ufb02ip: White-box adversarial examples for text\nclassi\ufb01cation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics,\nACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 31\u201336, 2018.\n\n[13] Vitaly Feldman, Roy Frostig, and Moritz Hardt. The advantages of multiple classes for reducing over\ufb01tting\nfrom test set reuse. In Proceedings of the 36th International Conference on Machine Learning, pages\n1892\u20131900, 2019.\n\n[14] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.\n\nProceedings of the International Conference on Learning Representations (ICLR), 2015.\n\nIn\n\n[15] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent\nneural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing,\npages 6645\u20136649. IEEE, 2013.\n\n[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n10\n\n\f[17] H. Kahn and T. E. Harris. Estimation of particle transmission by random sampling. In Monte Carlo\n\nMethod, volume 12 of Applied Mathematics Series, pages 27\u201330. National Bureau of Standards, 1951.\n\n[18] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of\n\nToronto, 2009.\n\n[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classi\ufb01cation with deep convolutional\nneural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 25, pages 1097\u20131105. Curran Associates, Inc., 2012.\n\n[20] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/,\n\n2010.\n\n[21] Horia Mania, John Miller, Ludwig Schmidt, Moritz Hardt, and Benjamin Recht. Model similarity mitigates\n\ntest set overuse. 2019. arXiv:1905.12580.\n\n[22] Volodymyr Mnih, Csaba Szepesv\u00e1ri, and Jean-Yves Audibert. Empirical Bernstein stopping. In Proceed-\nings of the 25th International Conference on Machine Learning, ICML \u201908, pages 672\u2013679, New York,\nNY, USA, 2008. ACM.\n\n[23] Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. Transferability in machine learning: from\n\nphenomena to black-box attacks using adversarial samples. 2016. arXiv:1605.07277.\n\n[24] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami.\nPractical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference\non Computer and Communications Security, pages 506\u2013519. ACM, 2017.\n\n[25] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do CIFAR-10 classi\ufb01ers\n\ngeneralize to CIFAR-10? 2018. arXiv:1806.00451.\n\n[26] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classi\ufb01ers\n\ngeneralize to ImageNet? 2019. arXiv:1902.10811.\n\n[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nProceedings of the International Conference on Learning Representations (ICLR), 2015.\n\n[28] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In 2015 IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), 2015. arXiv:1409.4842.\n\n[29] T. Tieleman and G. Hinton. Lecture 6.5\u2014RmsProp: Divide the gradient by a running average of its recent\n\nmagnitude. COURSERA: Neural Networks for Machine Learning, 2012.\n\n[30] Jonathan Uesato, Brendan O\u2019Donoghue, A\u00e4ron van den Oord, and Pushmeet Kohli. Adversarial risk and\nthe dangers of evaluating against weak attacks. In Proceedings of the 35th International Conference on\nMachine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5025\u20135034, 2018.\narXiv:1802.05666.\n\n[31] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim\nKrikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing\nLiu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George\nKurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals,\nGreg Corrado, Macduff Hughes, and Jeffrey Dean. Google\u2019s neural machine translation system: Bridging\nthe gap between human and machine translation. 2016. arXiv:1609.08144.\n\n[32] Chhavi Yadav and L\u00e9on Bottou. Cold case: The lost MNIST digits. May 2019. arXiv:1905.10498.\n\n11\n\n\f", "award": [], "sourceid": 4241, "authors": [{"given_name": "Roman", "family_name": "Werpachowski", "institution": "DeepMind"}, {"given_name": "Andr\u00e1s", "family_name": "Gy\u00f6rgy", "institution": "DeepMind"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "DeepMind / University of Alberta"}]}