{"title": "Selective Classification for Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4878, "page_last": 4887, "abstract": "Selective classification techniques (also known as reject option) have not yet been considered in the context of deep neural networks (DNNs). These techniques can potentially significantly improve DNNs prediction performance by trading-off coverage. In this paper we propose a method to construct a selective classifier given a trained neural network. Our method allows a user to set a desired risk level. At test time, the classifier rejects instances as needed, to grant the desired risk (with high probability). Empirical results over CIFAR and ImageNet convincingly demonstrate the viability of our method, which opens up possibilities to operate DNNs in mission-critical applications. For example, using our method an unprecedented 2% error in top-5 ImageNet classification can be guaranteed with probability 99.9%, with almost 60% test coverage.", "full_text": "Selective Classi\ufb01cation for Deep Neural Networks\n\nYonatan Geifman\n\nComputer Science Department\n\nTechnion \u2013 Israel Institute of Technology\n\nyonatan.g@cs.technion.ac.il\n\nRan El-Yaniv\n\nComputer Science Department\n\nTechnion \u2013 Israel Institute of Technology\n\nrani@cs.technion.ac.il\n\nAbstract\n\nSelective classi\ufb01cation techniques (also known as reject option) have not yet been\nconsidered in the context of deep neural networks (DNNs). These techniques\ncan potentially signi\ufb01cantly improve DNNs prediction performance by trading-off\ncoverage. In this paper we propose a method to construct a selective classi\ufb01er\ngiven a trained neural network. Our method allows a user to set a desired risk\nlevel. At test time, the classi\ufb01er rejects instances as needed, to grant the desired\nrisk (with high probability). Empirical results over CIFAR and ImageNet con-\nvincingly demonstrate the viability of our method, which opens up possibilities to\noperate DNNs in mission-critical applications. For example, using our method an\nunprecedented 2% error in top-5 ImageNet classi\ufb01cation can be guaranteed with\nprobability 99.9%, and almost 60% test coverage.\n\n1\n\nIntroduction\n\nWhile self-awareness remains an illusive, hard to de\ufb01ne concept, a rudimentary kind of self-awareness,\nwhich is much easier to grasp, is the ability to know what you don\u2019t know, which can make you\nsmarter. The sub\ufb01eld dealing with such capabilities in machine learning is called selective prediction\n(also known as prediction with a reject option), which has been around for 60 years [1, 5]. The main\nmotivation for selective prediction is to reduce the error rate by abstaining from prediction when in\ndoubt, while keeping coverage as high as possible. An ultimate manifestation of selective prediction\nis a classi\ufb01er equipped with a \u201cdial\u201d that allows for precise control of the desired true error rate\n(which should be guaranteed with high probability), while keeping the coverage of the classi\ufb01er as\nhigh as possible.\nMany present and future tasks performed by (deep) predictive models can be dramatically enhanced\nby high quality selective prediction. Consider, for example, autonomous driving. Since we cannot\nrely on the advent of \u201csingularity\u201d, where AI is superhuman, we must manage with standard machine\nlearning, which sometimes errs. But what if our deep autonomous driving network were capable of\nknowing that it doesn\u2019t know how to respond in a certain situation, disengaging itself in advance and\nalerting the human driver (hopefully not sleeping at that time) to take over? There are plenty of other\nmission-critical applications that would likewise greatly bene\ufb01t from effective selective prediction.\nThe literature on the reject option is quite extensive and mainly discusses rejection mechanisms for\nvarious hypothesis classes and learning algorithms, such as SVM, boosting, and nearest-neighbors\n[8, 13, 3]. The reject option has rarely been discussed in the context of neural networks (NNs), and\nso far has not been considered for deep NNs (DNNs). Existing NN works consider a cost-based\nrejection model [2, 4], whereby the costs of misclassi\ufb01cation and abstaining must be speci\ufb01ed, and a\nrejection mechanism is optimized for these costs. The proposed mechanism for classi\ufb01cation is based\non applying a carefully selected threshold on the maximal neuronal response of the softmax layer.\nWe that call this mechanism softmax response (SR). The cost model can be very useful when we can\nquantify the involved costs, but in many applications of interest meaningful costs are hard to reason.\n(Imagine trying to set up appropriate rejection/misclassi\ufb01cation costs for disengaging an autopilot\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdriving system.) Here we consider the alternative risk-coverage view for selective classi\ufb01cation\ndiscussed in [5].\nEnsemble techniques have been considered for selective (and con\ufb01dence-rated) prediction, where\nrejection mechanisms are typically based on the ensemble statistics [18, 7]. However, such techniques\nare presently hard to realize in the context of DNNs, for which it could be very costly to train\nsuf\ufb01ciently many ensemble members. Recently, Gal and Ghahramani [9] proposed an ensemble-like\nmethod for measuring uncertainty in DNNs, which bypasses the need to train several ensemble\nmembers. Their method works via sampling multiple dropout applications of the forward pass to\nperturb the network prediction randomly. While this Monte-Carlo dropout (MC-dropout) technique\nwas not mentioned in the context of selective prediction, it can be directly applied as a viable selective\nprediction method using a threshold, as we discuss here.\nIn this paper we consider classi\ufb01cation tasks, and our goal is to learn a selective classi\ufb01er (f, g),\nwhere f is a standard classi\ufb01er and g is a rejection function. The selective classi\ufb01er has to allow\nfull guaranteed control over the true risk. The ideal method should be able to classify samples in\nproduction with any desired level of risk with the optimal coverage rate. It is reasonable to assume\nthat this optimal performance can only be obtained if the pair (f, g) is trained together. As a \ufb01rst step,\nhowever, we consider a simpler setting where a (deep) neural classi\ufb01er f is already given, and our\ngoal is to learn a rejection function g that will guarantee with high probability a desired error rate.\nTo this end, we consider the above two known techniques for rejection (SR and MC-dropout), and\ndevise a learning method that chooses an appropriate threshold that ensures the desired risk. For a\ngiven classi\ufb01er f, con\ufb01dence level \u03b4, and desired risk r\u2217, our method outputs a selective classi\ufb01er\n(f, g) whose test error will be no larger than r\u2217 with probability of at least 1 \u2212 \u03b4.\nUsing the well-known VGG-16 architecture, we apply our method on CIFAR-10, CIFAR-100 and\nImageNet (on ImageNet we also apply the RESNET-50 architecture). We show that both SR and\ndropout lead to extremely effective selective classi\ufb01cation. On both the CIFAR datasets, these two\nmechanisms achieve nearly identical results. However, on ImageNet, the simpler SR mechanism\nis signi\ufb01cantly superior. More importantly, we show that almost any desirable risk level can be\nguaranteed with a surprisingly high coverage. For example, an unprecedented 2% error in top-5\nImageNet classi\ufb01cation can be guaranteed with probability 99.9%, and almost 60% test coverage.\n\n2 Problem Setting\nWe consider a standard multi-class classi\ufb01cation problem. Let X be some feature space (e.g., raw\nimage data) and Y, a \ufb01nite label set, Y = {1, 2, 3, . . . , k}, representing k classes. Let P (X, Y ) be\na distribution over X \u00d7 Y. A classi\ufb01er f is a function f : X \u2192 Y, and the true risk of f w.r.t. P\nis R(f|P ) =\u2206 EP (X,Y )[(cid:96)(f (x), y)], where (cid:96) : Y \u00d7 Y \u2192 R+ is a given loss function, for example\n(cid:80)m\ni=1 \u2286 (X \u00d7 Y) sampled i.i.d. from P (X, Y ), the\nthe 0/1 error. Given a labeled set Sm = {(xi, yi)}m\nempirical risk of the classi\ufb01er f is \u02c6r(f|Sm) =\u2206 1\nA selective classi\ufb01er [5] is a pair (f, g), where f is a classi\ufb01er, and g : X \u2192 {0, 1} is a selection\nfunction, which serves as a binary quali\ufb01er for f as follows,\n\ni=1 (cid:96)(f (xi), yi).\n\nm\n\n(cid:26) f (x),\n\n(f, g)(x) =\u2206\n\ndon\u2019t know,\n\nif g(x) = 1;\nif g(x) = 0.\n\nThus, the selective classi\ufb01er abstains from prediction at a point x iff g(x) = 0. The performance\nof a selective classi\ufb01er is quanti\ufb01ed using coverage and risk. Fixing P , coverage, de\ufb01ned to be\n\u03c6(f, g) =\u2206 EP [g(x)], is the probability mass of the non-rejected region in X . The selective risk of\n(f, g) is\n\nR(f, g) =\u2206 EP [(cid:96)(f (x), y)g(x)]\n\n\u03c6(f, g)\n\n.\n\n(1)\n\nClearly, the risk of a selective classi\ufb01er can be traded-off for coverage. The entire performance pro\ufb01le\nof such a classi\ufb01er can be speci\ufb01ed by its risk-coverage curve, de\ufb01ned to be risk as a function of\ncoverage [5].\nConsider the following problem. We are given a classi\ufb01er f, a training sample Sm, a con\ufb01dence\nparameter \u03b4 > 0, and a desired risk target r\u2217 > 0. Our goal is to use Sm to create a selection function\n\n2\n\n\fg such that the selective risk of (f, g) satis\ufb01es\n\nPrSm {R(f, g) > r\u2217} < \u03b4,\n\n(2)\nwhere the probability is over training samples, Sm, sampled i.i.d. from the unknown underlying\ndistribution P . Among all classi\ufb01ers satisfying (2), the best ones are those that maximize the coverage.\nFor a \ufb01xed f, and a given class G (which will be discussed below), in this paper our goal is to select\ng \u2208 G such that the selective risk R(f, g) satis\ufb01es (2) while the coverage \u03a6(f, g). is maximized.\n\n3 Selection with Guaranteed Risk Control\n\nIn this section, we present a general technique for constructing a selection function with guaranteed\nperformance, based on a given classi\ufb01er f, and a con\ufb01dence-rate function \u03baf : X \u2192 R+ for f.\nWe do not assume anything on \u03baf , and the interpretation is that \u03ba can rank in the sense that if\n\u03baf (x1) \u2265 \u03baf (x2), for x1, x2 \u2208 X , the con\ufb01dence function \u03baf indicates that the con\ufb01dence in the\nprediction f (x2) is not higher than the con\ufb01dence in the prediction f (x1). In this section we are not\nconcerned with the question of what is a good \u03baf (which is discussed in Section 4); our goal is to\ngenerate a selection function g, with guaranteed performance for a given \u03baf .\nFor the reminder of this paper, the loss function (cid:96) is taken to be the standard 0/1 loss function (unless\nexplicitly mentioned otherwise). Let Sm = {(xi, yi)}m\ni=1 \u2286 (X \u00d7 Y)m be a training set, assumed\nto be sampled i.i.d. from an unknown distribution P (X, Y ). Given also are a con\ufb01dence parameter\n\u03b4 > 0, and a desired risk target r\u2217 > 0. Based on Sm, our goal is to learn a selection function g such\nthat the selective risk of the classi\ufb01er (f, g) satis\ufb01es (2).\nFor \u03b8 > 0, we de\ufb01ne the selection function g\u03b8 : X \u2192 {0, 1} as\n\ng\u03b8(x) = g\u03b8(x|\u03baf ) =\u2206\n\nif \u03baf (x) \u2265 \u03b8;\n\n0, otherwise.\n\n(3)\n\n(cid:26) 1,\n(cid:80)m\n\nFor any selective classi\ufb01er (f, g), we de\ufb01ne its empirical selective risk with respect to the labeled\nsample Sm,\n\n\u02c6r(f, g|Sm) =\u2206\n\n1\nm\n\ni=1 (cid:96)(f (xi), yi)g(xi)\n\u02c6\u03c6(f, g|Sm)\n\n,\n\n(cid:80)m\n\nm\n\ni=1 g(xi). For any selection function g,\n\nwhere \u02c6\u03c6 is the empirical coverage, \u02c6\u03c6(f, g|Sm) =\u2206 1\ndenote by g(Sm) the g-projection of Sm, g(Sm) =\u2206 {(x, y) \u2208 Sm : g(x) = 1}.\nThe selection with guaranteed risk (SGR) learning algorithm appears in Algorithm 1. The algorithm\nreceives as input a classi\ufb01er f, a con\ufb01dence-rate function \u03baf , a con\ufb01dence parameter \u03b4 > 0, a target\nrisk r\u22171, and a training set Sm. The algorithm performs a binary search to \ufb01nd the optimal bound\nguaranteeing the required risk with suf\ufb01cient con\ufb01dence. The SGR algorithm outputs a selective\nclassi\ufb01er (f, g) and a risk bound b\u2217. In the rest of this section we analyze the SGR algorithm. We\nmake use of the following lemma, which gives the tightest possible numerical generalization bound\nfor a single classi\ufb01er, based on a test over a labeled sample.\n\nLemma 3.1 (Gascuel and Caraux, 1992, [10]) Let P be any distribution and consider a classi\ufb01er\nf whose true error w.r.t. P is R(f|P ). Let 0 < \u03b4 < 1 be given and let \u02c6r(f|Sm) be the empirical\nerror of f w.r.t. to the labeled set Sm, sampled i.i.d. from P . Let B\u2217(\u02c6ri, \u03b4, Sm) be the solution b of\nthe following equation,\n\nm\u00b7\u02c6r(f|Sm)(cid:88)\nj\nThen, PrSm{R(f|P ) > B\u2217(\u02c6ri, \u03b4, Sm)} < \u03b4.\nWe emphasize that the numerical bound of Lemma 3.1 is the tightest possible in this setting. As\ndiscussed in [10], the analytic bounds derived using, e.g., Hoeffding inequality (or other concentration\ninequalities), approximate this numerical bound and incur some slack.\n\nbj(1 \u2212 b)m\u2212j = \u03b4.\n\n(cid:18)m\n(cid:19)\n\n(4)\n\nj=0\n\n1Whenever the triplet Sm, \u03b4 and r\u2217 is infeasible, the algorithm will return a vacuous solution with zero\n\ncoverage.\n\n3\n\n\fthat indices re\ufb02ect this\n\nAlgorithm 1 Selection with Guaranteed Risk (SGR)\n1: SGR(f,\u03baf ,\u03b4,r\u2217,Sm)\n2: Sort Sm according to \u03baf (xi), xi \u2208 Sm (and now assume w.l.o.g.\n\nordering).\n\nz = (cid:100)(zmin + zmax)/2(cid:101)\n\u03b8 = \u03baf (xz)\ngi = g\u03b8 {(see (3))}\n\u02c6ri = \u02c6r(f, gi|Sm)\ni = B\u2217(\u02c6ri, \u03b4/(cid:100)log2 m(cid:101), gi(Sm)) {see Lemma 3.1 }\nb\u2217\nif b\u2217\ni < r\u2217 then\nzmax = z\n\n3: zmin = 1; zmax = m\n4: for i = 1 to k =\u2206 (cid:100)log2 m(cid:101) do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\nend if\n14:\n15: end for\n16: Output- (f, gk) and the bound b\u2217\nk.\n\nzmin = z\n\nelse\n\nFor any selection function, g, let Pg(X, Y ) be the projection of P over g; that is, Pg(X, Y ) =\u2206\nP (X, Y |g(X) = 1). The following theorem is a uniform convergence result for the SGR procedure.\nTheorem 3.2 (SGR) Let Sm be a given labeled set, sampled i.i.d. from P , and consider an appli-\ncation of the SGR procedure. For k =\u2206 (cid:100)log2 m(cid:101), let (f, gi) and b\u2217\ni , i = 1, . . . , k, be the selective\nclassi\ufb01er and bound computed by SGR in its ith iterations. Then,\n\nPrSm {\u2203i : R(f|Pgi) > B\u2217(\u02c6ri, \u03b4/k, gi(Sm))} < \u03b4.\n\nProof Sketch: For any i = 1, . . . , k, let mi = |gi(Sm)| be the random variable giving the size of\naccepted examples from Sm on the ith iteration of SGR. For any \ufb01xed value of 0 \u2264 mi \u2264 m, by\nLemma 3.1, applied with the projected distribution Pgi(X, Y ), and a sample Smi, consisting of mi\nexamples drawn from the product distribution (Pgi)mi,\n\nPrSmi\u223c(Pgi )mi {R(f|Pgi) > B\u2217(\u02c6ri, \u03b4/k, gi(Sm))} < \u03b4/k.\n\n(5)\nThe sampling distribution of mi labeled examples in SGR is determined by the following process:\nsample a set Sm of m examples from the product distribution P m and then use gi to \ufb01lter Sm,\nresulting in a (randon) number mi of examples. Therefore, the left-hand side of (5) equals\n\nPrSm\u223cP m {R(f|Pgi) > B\u2217(\u02c6ri, \u03b4/k, gi(Sm)) |gi(Sm) = mi} .\n\nClearly,\n\nTherefore,\n\n=\n\n\u2264 \u03b4\nk\n\nn=0\n\n\u03c6(f, g)\n\n= R(f, gi).\n\n[(cid:96)(f (x), y)] =\n\nEP [(cid:96)(f (x), y)g(x)]\n\nR(f|Pgi) = EPgi\nm(cid:88)\nPrSm{R(f, gi) > B\u2217(\u02c6ri, \u03b4/k, gi(Sm))}\nPrSm{R(f, gi) > B\u2217(\u02c6ri, \u03b4/k, gi(Sm)) | gi(Sm) = n} \u00b7 Pr{gi(Sm) = n}\nm(cid:88)\n\nn=0\n\nPr{gi(Sm) = n} =\n\n\u03b4\nk\n\n.\n\nAn application of the union bound completes the proof.\n\n4 Con\ufb01dence-Rate Functions for Neural Networks\n\n(cid:3)\n\nConsider a classi\ufb01er f, assumed to be trained for some unknown distribution P . In this section we\nconsider two con\ufb01dence-rate functions, \u03baf , based on previous work [9, 2]. We note that an ideal\n\n4\n\n\fcon\ufb01dence-rate function \u03baf (x) for f, should re\ufb02ect true loss monotonicity. Given (x1, y1) \u223c P and\n(x2, y2) \u223c P , we would like the following to hold: \u03baf (x1) \u2264 \u03baf (x2) if and only if (cid:96)(f (x1), y1) \u2265\n(cid:96)(f (x2), y2). Obviously, one cannot expect to have an ideal \u03baf . Given a con\ufb01dence-rate functions\n\u03baf , a useful way to analyze its effectiveness is to draw the risk-coverage curve of its induced rejection\nfunction, g\u03b8(x|\u03baf ), as de\ufb01ned in (3). This risk-coverage curve shows the relationship between \u03b8\nand R(f, g\u03b8). For example, see Figure 2(a) where a two (nearly identical) risk-coverage curves are\nplotted. While the con\ufb01dence-rate functions we consider are not ideal, they will be shown empirically\nto be extremely effective. 2\nThe \ufb01rst con\ufb01dence-rate function we consider has been around in the NN folklore for years, and is\nexplicitly mentioned by [2, 4] in the context of reject option. This function works as follows: given\nany neural network classi\ufb01er f (x) where the last layer is a softmax, we denote by f (x|j) the soft\nresponse output for the jth class. The con\ufb01dence-rate function is de\ufb01ned as \u03ba =\u2206 maxj\u2208Y (f (x|j)).\nWe call this function softmax response (SR).\nSoftmax responses are often treated as probabilities (responses are positive and sum to 1), but some\nauthors criticize this approach [9]. Noting that, for our purposes, the ideal con\ufb01dence-rate function\nshould only provide coherent ranking rather than absolute probability values, softmax responses are\npotentially good candidates for relative con\ufb01dence rates.\nWe are not familiar with a rigorous explanation for SR, but it can be intuitively motivated by observing\nneuron activations. For example, Figure 1 depicts average response values of every neuron in the\nsecond-to-last layer for true positives and false positives for the class \u20188\u2019 in the MNIST dataset (and\nqualitatively similar behavior occurs in all MNIST classes). The x-axis corresponds to neuron indices\nin that layer (1-128); and the y-axis shows the average responses, where green squares are averages of\ntrue positives, boldface squares highlight strong responses, and red circles correspond to the average\nresponse of false positives. It is evident that the true positive activation response in the active neurons\nis much higher than the false positive, which is expected to be re\ufb02ected in the \ufb01nal softmax layer\nresponse. Moreover, it can be seen that the large activation values are spread over many neurons,\nindicating that the con\ufb01dence signal arises due to numerous patterns detected by neurons in this layer.\nQualitatively similar behavior can be observed in deeper layers.\n\nFigure 1: Average response values of neuron activations for class \"8\" on the MNIST dataset; green\nsquares, true positives, red circles, false negatives\n\nThe MC-dropout technique we consider was recently proposed to quantify uncertainty in neural\nnetworks [9]. To estimate uncertainty for a given instance x, we run a number of feed-forward\niterations over x, each applied with dropout in the last fully connected layer. Uncertainty is taken as\nthe variance in the responses of the neuron corresponding to the most probable class. We consider\nminus uncertainty as the MC-dropout con\ufb01dence rate.\n\n5 Empirical Results\n\nIn Section 4 we introduced the SR and MC-dropout con\ufb01dence-rate function, de\ufb01ned for a given\nmodel f. We trained VGG models [17] for CIFAR-10, CIFAR-100 and ImageNet. For each of these\nmodels f, we considered both the SR and MC-dropout con\ufb01dence-rate functions, \u03baf , and the induced\n\n2While Theorem 3.2 always holds, we note that if \u03baf is severely skewed (far from ideal), the bound of the\n\nresulting selective classi\ufb01er can be far from the target risk.\n\n5\n\n\frejection function, g\u03b8(x|\u03baf ). In Figure 2 we present the risk-coverage curves obtained for each of the\nthree datasets. These curves were obtained by computing a validation risk and coverage for many \u03b8\nvalues. It is evident that the risk-coverage pro\ufb01le for SR and MC-dropout is nearly identical for both\nthe CIFAR datasets. For the ImageNet set we plot the curves corresponding to top-1 (dashed curves)\nand top-5 tasks (solid curves). On this dataset, we see that SR is signi\ufb01cantly better than MC-dropout\non both tasks. For example, in the top-1 task and 60% coverage, the SR rejection has 10% error while\nMC-dropout rejection incurs more than 20% error. But most importantly, these risk-coverage curves\nshow that selective classi\ufb01cation can potentially be used to dramatically reduce the error in the three\ndatasets. Due to the relative advantage of SR, in the rest of our experiments we only focus on the SR\nrating.\n\n(a) CIFAR-10\n\n(b) CIFAR-100\n\n(c) Image-Net\n\nFigure 2: Risk coverage curves for (a) cifar-10, (b) cifar-100 and (c) image-net (top-1 task: dashed\ncurves; top-5 task: solid crves), SR method in blue and MC-dropout in red.\n\nWe now report on experiments with our SGR routine, and apply it on each of the datasets to construct\nhigh probability risk-controlled selective classi\ufb01ers for the three datasets.\n\nTable 1: Risk control results for CIFAR-10 for \u03b4 = 0.001\n\nDesired risk (r\u2217) Train risk Train coverage Test risk Test coverage Risk bound (b\u2217)\n0.01\n0.02\n0.03\n0.04\n0.05\n0.06\n\n0.0079\n0.0160\n0.0260\n0.0362\n0.0454\n0.0526\n\n0.7822\n0.8482\n0.8988\n0.9348\n0.9610\n0.9778\n\n0.0092\n0.0149\n0.0261\n0.0380\n0.0486\n0.0572\n\n0.7856\n0.8466\n0.8966\n0.9318\n0.9596\n0.9784\n\n0.0099\n0.0199\n0.0298\n0.0399\n0.0491\n0.0600\n\n5.1 Selective Guaranteed Risk for CIFAR-10\n\nWe now consider CIFAR-10; see [14] for details. We used the VGG-16 architecture [17] and adapted\nit to the CIFAR-10 dataset by adding massive dropout, exactly as described in [15]. We used data\naugmentation containing horizontal \ufb02ips, vertical and horizontal shifts, and rotations, and trained\nusing SGD with momentum of 0.9, initial learning rate of 0.1, and weight decay of 0.0005. We\nmultiplicatively dropped the learning rate by 0.5 every 25 epochs, and trained for 250 epochs. With\nthis setting we reached validation accuracy of 93.54, and used the resulting network f10 as the basis\nfor our selective classi\ufb01er.\nWe applied the SGR algorithm on f10 with the SR con\ufb01dence-rating function, where the training\nset for SGR, Sm, was taken as half of the standard CIFAR-10 validation set that was randomly split\nto two equal parts. The other half, which was not consumed by SGR for training, was reserved for\ntesting the resulting bounds. Thus, this training and test sets where each of approximately 5000\nsamples. We applied the SGR routine with several desired risk values, r\u2217, and obtained, for each\nsuch r\u2217, corresponding selective classi\ufb01er and risk bound b\u2217. All our applications of the SGR routine\n\n6\n\n\f(for this dataset and the rest) where with a particularly small con\ufb01dence level \u03b4 = 0.001.3 We then\napplied these selective classi\ufb01ers on the reserved test set, and computed, for each selective classi\ufb01er,\ntest risk and test coverage. The results are summarized in Table 1, where we also include train risk\nand train coverage that were computed, for each selective classi\ufb01er, over the training set.\nObserving the results in Table 1, we see that the risk bound, b\u2217, is always very close to the target risk,\nr\u2217. Moreover, the test risk is always bounded above by the bound b\u2217, as required. We compared this\nresult to a basic baseline in which the threshold is de\ufb01ned to be the value that maximizes coverage\nwhile keeping train error smaller then r\u2217. For this simple baseline we found that in over 50% of\nthe cases (1000 random train/test splits), the bound r\u2217 was violated over the test set, with a mean\nviolation of 18% relative to the requested r\u2217. Finally, we see that it is possible to guarantee with this\nmethod amazingly small 1% error while covering more than 78% of the domain.\n\n5.2 Selective Guaranteed Risk for CIFAR-100\n\nUsing the same VGG architechture (now adapted to 100 classes) we trained a model for CIFAR-100\nwhile applying the same data augmentation routine as in the CIFAR-10 experiment. Following\nprecisly the same experimental design as in the CFAR-10 case, we obtained the results of Table 2\n\nTable 2: Risk control results for CIFAR-100 for \u03b4 = 0.001\n\nDesired risk (r\u2217) Train risk Train coverage Test risk Test coverage Risk bound (b\u2217)\n0.02\n0.05\n0.10\n0.15\n0.20\n0.25\n\n0.0119\n0.0425\n0.0927\n0.1363\n0.1872\n0.2380\n\n0.2010\n0.4286\n0.5736\n0.6546\n0.7650\n0.8716\n\n0.0187\n0.0413\n0.0938\n0.1327\n0.1810\n0.2395\n\n0.2134\n0.4450\n0.5952\n0.6752\n0.7778\n0.8826\n\n0.0197\n0.0499\n0.0998\n0.1498\n0.1999\n0.2499\n\nHere again, SGR generated tight bounds, very close to the desired target risk, and the bounds were\nnever violated by the true risk. Also, we see again that it is possible to dramatically reduce the risk\nwith only moderate compromise of the coverage. While the architecture we used is not state-of-the art,\nwith a coverage of 67%, we easily surpassed the best known result for CIFAR-100, which currently\nstands on 18.85% using the wide residual network architecture [19]. It is very likely that by using\nourselves the wide residual network architecture we could obtain signi\ufb01cantly better results.\n\n5.3 Selective Guaranteed Risk for ImageNet\n\nWe used an already trained Image-Net VGG-16 model based on ILSVRC2014 [16]. We repeated the\nsame experimental design but now the sizes of the training and test set were approximately 25,000.\nThe SGR results for both the top-1 and top-5 classi\ufb01cation tasks are summarized in Tables 3 and 4,\nrespectively. We also implemented the RESNET-50 architecture [12] in order to see if qualitatively\nsimilar results can be obtained with a different architecture. The RESNET-50 results for ImageNet\ntop-1 and top-5 classi\ufb01cation tasks are summarized in Tables 5 and 6, respectively.\n\nTable 3: SGR results for Image-Net dataset using VGG-16 top-1 for \u03b4 = 0.001\n\nDesired risk (r\u2217) Train risk Train coverage Test risk Test coverage Risk bound(b\u2217)\n0.02\n0.05\n0.10\n0.15\n0.20\n0.25\n\n0.0161\n0.0462\n0.0964\n0.1466\n0.1937\n0.2441\n\n0.0131\n0.0446\n0.0948\n0.1467\n0.1949\n0.2445\n\n0.2355\n0.4292\n0.5968\n0.7164\n0.8131\n0.9117\n\n0.0200\n0.0500\n0.1000\n0.1500\n0.2000\n0.2500\n\n0.2322\n0.4276\n0.5951\n0.7138\n0.8154\n0.9120\n\n3With this small \u03b4, and small number of reported experiments (6-7 lines in each table) we did not perform a\n\nBonferroni correction (which can be easily added).\n\n7\n\n\fTable 4: SGR results for Image-Net dataset using VGG-16 top-5 for \u03b4 = 0.001\n\nDesired risk (r\u2217) Train risk Train coverage Test risk Test coverage Risk bound(b\u2217)\n0.01\n0.02\n0.03\n0.04\n0.05\n0.06\n0.07\n\n0.0080\n0.0181\n0.0281\n0.0381\n0.0481\n0.0563\n0.0663\n\n0.0078\n0.0179\n0.0290\n0.0379\n0.0496\n0.0577\n0.0694\n\n0.3391\n0.5360\n0.6768\n0.7610\n0.8263\n0.8654\n0.9093\n\n0.0100\n0.0200\n0.0300\n0.0400\n0.0500\n0.0600\n0.0700\n\n0.3341\n0.5351\n0.6735\n0.7586\n0.8262\n0.8668\n0.9114\n\nTable 5: SGR results for Image-Net dataset using RESNET50 top-1 for \u03b4 = 0.001\n\nDesired risk (r\u2217) Train risk Train coverage Test risk Test coverage Risk bound (b\u2217)\n0.02\n0.05\n0.10\n0.15\n0.20\n0.25\n\n0.0161\n0.0462\n0.0965\n0.1466\n0.1937\n0.2441\n\n0.2613\n0.4906\n0.6544\n0.7711\n0.8688\n0.9634\n\n0.0164\n0.0474\n0.0988\n0.1475\n0.1955\n0.2451\n\n0.2585\n0.4878\n0.6502\n0.7676\n0.8677\n0.9614\n\n0.0199\n0.0500\n0.1000\n0.1500\n0.2000\n0.2500\n\nThese results show that even for the challenging ImageNet, with both the VGG and RESNET archi-\ntectures, our selective classi\ufb01ers are extremely effective, and with appropriate coverage compromise,\nour classi\ufb01er easily surpasses the best known results for ImageNet. Not surprisingly, RESNET, which\nis known to achieve better results than VGG on this set, preserves its relative advantage relative to\nVGG through all r\u2217 values.\n\n6 Concluding Remarks\n\nWe presented an algorithm for learning a selective classi\ufb01er whose risk can be fully controlled and\nguaranteed with high con\ufb01dence. Our empirical study validated this algorithm on challenging image\nclassi\ufb01cation datasets, and showed that guaranteed risk-control is achievable. Our methods can be\nimmediately used by deep learning practitioners, helping them in coping with mission-critical tasks.\nWe believe that our work is only the \ufb01rst signi\ufb01cant step in this direction, and many research questions\nare left open. The starting point in our approach is a trained neural classi\ufb01er f (supposedly trained to\noptimize risk under full coverage). While the rejection mechanisms we considered were extremely\neffective, it might be possible to identify superior mechanisms for a given classi\ufb01er f. We believe,\nhowever, that the most challenging open question would be to simultaneously train both the classi\ufb01er\nf and the selection function g to optimize coverage for a given risk level. Selective classi\ufb01cation\nis intimately related to active learning in the context of linear classi\ufb01ers [6, 11]. It would be very\ninteresting to explore this potential relationship in the context of (deep) neural classi\ufb01cation. In this\npaper we only studied selective classi\ufb01cation under the 0/1 loss. It would be of great importance\n\nTable 6: SGR results for Image-Net dataset using RESNET50 top-5 for \u03b4 = 0.001\n\nDesired risk (r\u2217) Train risk Train coverage Test risk Test coverage Risk bound(b\u2217)\n0.01\n0.02\n0.03\n0.04\n0.05\n0.06\n0.07\n\n0.0080\n0.0181\n0.0281\n0.0381\n0.0481\n0.0581\n0.0663\n\n0.0085\n0.0189\n0.0273\n0.0358\n0.0464\n0.0552\n0.0629\n\n0.3796\n0.5938\n0.7122\n0.8180\n0.8856\n0.9256\n0.9508\n\n0.0099\n0.0200\n0.0300\n0.0400\n0.0500\n0.0600\n0.0700\n\n0.3807\n0.5935\n0.7096\n0.8158\n0.8846\n0.9231\n0.9484\n\n8\n\n\fto extend our techniques to other loss functions and speci\ufb01cally to regression, and to fully control\nfalse-positive and false-negative rates.\nThis work has many applications. In general, any classi\ufb01cation task where a controlled risk is critical\nwould bene\ufb01t by using our methods. An obvious example is that of medical applications where\nutmost precision is required and rejections should be handled by human experts. In such applications\nthe existence of performance guarantees, as we propose here, is essential. Financial investment\napplications are also obvious, where there are great many opportunities from which one should\ncherry-pick the most certain ones. A more futuristic application is that of robotic sales representatives,\nwhere it could extremely harmful if the bot would try to answer questions it does not fully understand.\n\nAcknowledgments\n\nThis research was supported by The Israel Science Foundation (grant No. 1890/14)\n\nReferences\n[1] Chao K Chow. An optimum character recognition system using decision functions.\n\nTransactions on Electronic Computers, (4):247\u2013254, 1957.\n\nIRE\n\n[2] Luigi Pietro Cordella, Claudio De Stefano, Francesco Tortorella, and Mario Vento. A method\nfor improving classi\ufb01cation reliability of multilayer perceptrons. IEEE Transactions on Neural\nNetworks, 6(5):1140\u20131147, 1995.\n\n[3] Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Boosting with abstention. In Advances in\n\nNeural Information Processing Systems, pages 1660\u20131668, 2016.\n\n[4] Claudio De Stefano, Carlo Sansone, and Mario Vento. To reject or not to reject: that is the\nquestion-an answer in case of neural classi\ufb01ers. IEEE Transactions on Systems, Man, and\nCybernetics, Part C (Applications and Reviews), 30(1):84\u201394, 2000.\n\n[5] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classi\ufb01cation. Journal of\n\nMachine Learning Research, 11:1605\u20131641, 2010.\n\n[6] Ran El-Yaniv and Yair Wiener. Active learning via perfect selective classi\ufb01cation. Journal of\n\nMachine Learning Research (JMLR), 13(Feb):255\u2013279, 2012.\n\n[7] Yoav Freund, Yishay Mansour, and Robert E Schapire. Generalization bounds for averaged\n\nclassi\ufb01ers. Annals of Statistics, pages 1698\u20131722, 2004.\n\n[8] Giorgio Fumera and Fabio Roli. Support vector machines with embedded reject option. In\n\nPattern recognition with support vector machines, pages 68\u201382. Springer, 2002.\n\n[9] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: representing model\nuncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine\nLearning, pages 1050\u20131059, 2016.\n\n[10] O. Gascuel and G. Caraux. Distribution-free performance bounds with the resubstitution error\n\nestimate. Pattern Recognition Letters, 13:757\u2013764, 1992.\n\n[11] R. Gelbhart and R. El-Yaniv. The Relationship Between Agnostic Selective Classi\ufb01cation and\n\nActive. ArXiv e-prints, January 2017.\n\n[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 770\u2013778, 2016.\n\n[13] Martin E Hellman. The nearest neighbor classi\ufb01cation rule with a reject option. IEEE Transac-\n\ntions on Systems Science and Cybernetics, 6(3):179\u2013185, 1970.\n\n[14] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n9\n\n\f[15] Shuying Liu and Weihong Deng. Very deep convolutional neural network based image classi\ufb01-\ncation using small training sample size. In Pattern Recognition (ACPR), 2015 3rd IAPR Asian\nConference on, pages 730\u2013734. IEEE, 2015.\n\n[16] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet large scale visual recognition challenge. International Journal of Computer Vision\n(IJCV), 115(3):211\u2013252, 2015.\n\n[17] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[18] Kush R Varshney. A risk bound for ensemble classi\ufb01cation with a reject option. In Statistical\n\nSignal Processing Workshop (SSP), 2011 IEEE, pages 769\u2013772. IEEE, 2011.\n\n[19] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.\n\narXiv:1605.07146, 2016.\n\narXiv preprint\n\n10\n\n\f", "award": [], "sourceid": 2519, "authors": [{"given_name": "Yonatan", "family_name": "Geifman", "institution": "Technion"}, {"given_name": "Ran", "family_name": "El-Yaniv", "institution": "Technion"}]}