{"title": "(Not) Bounding the True Error", "book": "Advances in Neural Information Processing Systems", "page_first": 809, "page_last": 816, "abstract": null, "full_text": "(Not) Bounding the True Error\n\nJohn Langford\n\nDepartment of Computer Science\n\nCarnegie-Mellon University\n\nPittsburgh, PA 15213\n\njcl+@cs.cmu.edu\n\nRich Caruana\n\nDepartment of Computer Science\n\nCornell University\nIthaca, NY 14853\n\ncaruana@cs.cornell.edu\n\nAbstract\n\nWe present a new approach to bounding the true error rate of a continuous\nvalued classi\ufb01er based upon PAC-Bayes bounds. The method \ufb01rst con-\nstructs a distribution over classi\ufb01ers by determining how sensitive each\nparameter in the model is to noise. The true error rate of the stochastic\nclassi\ufb01er found with the sensitivity analysis can then be tightly bounded\nusing a PAC-Bayes bound. In this paper we demonstrate the method on\narti\ufb01cial neural networks with results of a \u0002\u0001\u0004\u0003 order of magnitude im-\nprovement vs. the best deterministic neural net bounds.\n\n1 Introduction\nIn machine learning it is important to know the true error rate a classi\ufb01er will achieve on\nfuture test cases. Estimating this error rate can be suprisingly dif\ufb01cult. For example, all\nknown bounds on the true error rate of arti\ufb01cial neural networks tend to be extremely loose\nand often result in the meaningless bound of \u201calways err\u201d (error rate = 1.0).\n\nIn this paper, we do not bound the true error rate of a neural network. Instead, we bound\nthe true error rate of a distribution over neural networks which we create by analysing one\nneural network. (Hence, the title.) This approach proves to be much more fruitful than\ntrying to bound the true error rate of an individual network. The best current approaches\n\nthe true error rate. We produce nontrivial bounds on the true error rate of a stochastic neural\n\n[1][2] often require \u0005\u0007\u0006\b\u0006\t\u0006 , \u0005\u0007\u0006\b\u0006\t\u0006\t\u0006 , or more examples before producing a nontrivial bound on\nnetwork with less than \u0005\n\u0006\t\u0006 examples. A stochastic neural network is a neural network\nwhere each weight \u000b\r\f is perturbed by a gaussian with variance \u000e\u0010\u000f\n\nevery time it is evaluated.\nOur approach uses the PAC-Bayes bound [5]. The approach can be thought of as a\nredivision of the work between the experimenter and the theoretician: we make the experi-\nmenter work harder so that the theoretician\u2019s true error bound becomes much tighter. This\n\u201cextra work\u201d on the part of the experimenter is signi\ufb01cant, but tractable, and the resulting\nbounds are much tighter.\n\nAn alternative viewpoint is that the classi\ufb01cation problem is \ufb01nding a hypothesis with\na low upper bound on the future error rate. We present a post-processing phase for neural\nnetworks which results in a classi\ufb01er with a much lower upper bound on the future error\nrate. The post-processing can be used with any arti\ufb01cial neural net trained with any opti-\nmization method; it does not require the learning procedure be modi\ufb01ed, re-run, or even\nthat the threshold function be differentiable. In fact, this post-processing step can easily be\nadapted to other learning algorithms.\n\nDavid MacKay [4] has done signi\ufb01cant work to make approximate Bayesian learning\ntractable with a neural network. Our work here is complimentary rather than competitive.\nWe exhibit a technique which will likely give nontrivial true error rate bounds for Bayesian\n\n\f\n\fneural networks regardless of approximation or prior modeling errors. Veri\ufb01cation of this\nstatement is work in progress.\n\nThe post-processing step \ufb01nds a \u201clarge\u201d distribution over classi\ufb01ers, which has a small\naverage empirical error rate. Given the average empirical error rate, it is straightforward\nto apply the PAC-Bayes bound in order to \ufb01nd a bound on the average true error rate. We\n\ufb01nd this large distribution over classi\ufb01ers by performing a simple noise sensitivy analysis\non the learned model. The noise model allows us to generate a distribution of classi\ufb01ers\nwith a known, small, average empirical error rate. In this paper we refer to the distribution\nof neural nets that results from this noise analysis as a stochastic neural net model.\n\nWhy do we expect the PAC-Bayes bound to be a signi\ufb01cant improvement over standard\ncovering number and VC bound approaches? There exist learning problems for which\nthe difference between the lower bound and the PAC-Bayes upper bound are tight up to\nis the number of training examples. This is superior to the guarantees\nwhich can be made for typical covering number bounds where the gap is, at best, known\nup to an (asymptotic) constant. The guarantee that PAC-Bayes bounds are sometimes quite\ntight encourages us to apply them here.\n\n\u0005\u0007\u0006\n\u0006\t\b where\n\n\u0002\u0001\u0004\u0003\n\nThe next sections will:\n\n1. Describe the bounds we will compare.\n2. Describe our algorithm for constructing a distribution over neural networks.\n3. Present experimental results.\n\n2 Theoretical setup\nWe will work in the standard supervised batch learning setting. This setting starts with the\n, over\n\npairs to \ufb01nd\n, which maps the input space to the output space and has a small true error,\nis unknown, the true error rate is not\n\nassumption that all examples are drawn from some \ufb01xed (unknown) distribution,\u000b\n(input, output) pairs,\f\u000e\r\u0010\u000f\u0012\u0011\u0014\u0013 . The output\u0011\nis drawn from the space\u0015\n\u0005\u0016\u000f\nspace is arbitrary. The goal of machine learning is to use a sample set\u0019 of\n\na classi\ufb01er,\u001a\n\f\u001c\u001a\u001d\u0013\u001f\u001e! #\"%$&\f\u001c\u001a\u0010\f\u000e\r\u0007\u0013('\nobservable. However, we can observe the empirical error rate,*\n\f+\u001a,\u0013-\u001e. #\"\u0014/\u0010\f\u001c\u001a\u0010\f\u000e\r\u001d\u00130'\n\u00136'\n\u001a\u0010\f\u000e\r\n\u000632\n\f54\n\nNow that the basic quantities of interest are de\ufb01ned, we will \ufb01rst present a modern neu-\nral network bound, then specialize the PAC-Bayes bound to a stochastic neural network. A\nstochastic neural network is simply a neural network where each weight in the neural net-\nwork is drawn from some distribution whenever it is used. We will describe our technique\nfor constructing the distribution of the stochastic neural network.\n\n\u0005\u0018\u0017 and the input\n\u0011\u0014\u0013\n\n\u0011\u0014\u0013 . Since the distribution\u000b\n\n\f .\n\n2.1 Neural Network bound\nWe will compare a specialization of the best current neural network true error rate bound\n[2] with our approach. The neural network bound is described in terms of the following\nparameters:\n\n3.\n\n4.\n\n5.\n\n\u0005 .\n\nthe neural network\n\nif\r?>\n\n\u0005 , and linear in between.\n\nLipschitz constant is a bound on the magnitude of the derivative.\n\n1. A margin, \u0006\u001f798:7\n2. An arbitrary function (unrelated to the neural network sigmoid function); de-\n\ufb01ned by;<\f=\r\u001d\u0013\nif\r:7\n\u0006 ,;<\f=\r\u001d\u0013\n\f , an upper bound on the sum of the magnitude of the weights in theA th layer of\n\f , a Lipschitz constant which holds for theA th layer of the neural network. A\nC , the size of the input space.\n\u001a(HJI\tK\n\nWith these parameters de\ufb01ned, we get the following bound.\nTheorem 2.1 (2 layer feed-forward Neural Network true error bound)\n\n\fT8U\u0013WVYX[Z\n\n\f\u001c\u001a\u001d\u0013L>9M5NPO\nQSR\n\n D\"\n\nEGF\n\n\u0001\n\u001b\n)\n\u001b\n)\n)\n1\n\u0006\n1\n\f\n)\n\u0011\n)\n\u0005\n)\n\u0006\n@\nB\n$\n\u001b\n\f\u000f\u001b\u001a\n\nwhereR\n\n\f\u000e8\n\nProof: Given in [2].\n\n\u0001\u0003\u0002\u0005\u0004\n\n\u0006\b\u0007\n\n\u0006[2\n\n\u0006\u000b\n\u0005\u0001\u0003\u0002\f\u0007\nQ\u000e\r\u0010\u000f\n\n\u0016\u0015\n\n\u0017\u0019\u0018\n\n\u000f\u0012\u0011\n\n\u000f\u0014\u0013\n\nThe theorem is actually only given up to a universal constant. \u201c\u0003\b \u201d might be the right\nchoice, but this is just an educated guess. The neural network true error bound above is\n(perhaps) the tightest known bound for general feed-forward neural networks and so it is\nthe natural bound to compare with.\n\nThis 2 layer feed-forward bound is not easily applied in a tight manner because we can\u2019t\n\n\u001d'&\n\n, we get the following theorem:\n\nTheorem 2.2 (2 layer feed-forward Neural Network true error bound)\n\n\f should be. This can be patched up using the\ncalculate a priori what our weight bound@\n)! #\"\nprinciple of structural risk minimization. In particular, we can state the bound for@\nis a constant. If the$ th bound holds with\nwhere$\nis some non-negative integer and \nZ , since\nprobability %\n\u001d , then all bounds will hold simultaneously with probability \u0005\n)+*\n1 and@\nApplying this approach to the values of both@\n\f\u001c\u001a\u001d\u0013L>3M\nN\u0014O\n\fT8\n\".-\n\u0017\u0019\u0018\n\u000f\u0014\u0013\n\n #\"\nwhereR\n\f\u000e8<\u000f1$\nProof: Apply the union bound to all possible values of$ and0 as discussed above.\n\u0005 and report the value of the tightest applicable bound\nIn practice, we will use \nfor all$\u0016\u000f.0\n\n2.2 Stochastic Neural Network bound\nOur approach will start with a simple re\ufb01nement [3] of the original PAC-Bayes bound [5].\nWe will \ufb01rst specialize this bound to stochastic neural networks and then show that the use\nof this bound in conjunction with a post-processing algorithm results in a much tighter true\nerror rate upper bound.\n\nHJI\tK\n\u0006\u000b\n\u0005\u0001\u0003\u0002\f\u0007\n\r\u0010\u000f\n\u0005\u0005>\n\nX9Z\n\u000f54\n\n\u000f/$\u0016\u000f\u00190\u0014\u0013\n\n\u000576\b8\u00149\n\n\u0005\n\u0013\n\"\n\u0001\n(\n)\n\"\n4\n1\n\u0005\n$\n\u000f\n\u000f\n,\n\u000f\n$\nE\nF\n\u001a\n\u001b\nQ\nR\nV\n)\n1\n\u0006\n2\n;\n\t\nQ\n\u000f\n\u0011\nQ\n\u0003\n\n\u0015\n1\n\u0006\nB\n1\nB\n\u000f\n \n\u001c\n\u001d\n\u0003\n\u001e\n\u0018\n\u000f\n\u0011\n\u0006\n\u001f\n)\n2\n)\n?\n@\n*\nA\n*\n\u001b\n$\nE\nF\n?\n*\n@\n\u0013\nN\n\u000f\n\u0006\n&\n\n\u0001\n\u0005\nV\nX\nZ\n@\n\u0013\nP\nX\nC\n\u001a\n*\n*\n\fProof: Given in [3].\n\nWe need to specialize this theorem for application to a stochastic neural network with a\nchoice of the \u201cprior\u201d. Our \u201cprior\u201d will be zero on all neural net structures other than the\none we train and a multidimensional isotropic gaussian on the values of the weights in our\n\nthe following corollary:\n\n)\u0001\u000b \n\n(1)\n\n,\n\n\f and \u000e\n\n\u000f\u0007\u0010\n\n\u0018\u000f\u000e\n\nOne more step is necessary in order to apply this bound. The essential dif\ufb01culty is\n\n as reasonable default values.\n\nProof: Analytic calculation of the KL divergence between two multidimensional Gaus-\n\ndif\ufb01cult. We will avoid the need for a direct evaluation by a monte carlo evaluation and\n\n\u000f . This choice is made for convenience and happens to work.\n\nis unknown and dependent on the learning problem so we will\nwish to parameterize it in an example dependent manner. We can do this using the same\n\nneural network. The multidimensional gaussian will have a mean of \u0006 and a variance in\nfor\n\u001d . Now,\n\nthe union bound will imply that all bounds hold simultaneously with probability at least\n\nis also de\ufb01ned by a multidimensional gaussian\n, we can specialize to\n\neach dimension ofR\nThe optimal value ofR\ntrick as for the original neural net bound. Use a sequence of bounds whereR\nsome constants and$ a nonnegative number. For the$ th bound setZ\u0003\u0002\nand \nZ .\nNow, assuming that our \u201cposterior\u201d?\nwith the mean and variance in each dimension de\ufb01ned by \u000b\nCorollary 2.4 (Stochastic Neural Network bound) Let 0 be the number of weights in a\n\f be the variance of theA th weight. Then, we have\n\f be theA th weight and \u000e\nneural net, \u000b\n1\u0007\u0006\nN\t\b\u000b\n\n\f\u000b\r\n\u001bHA\n\u001b\fA\nK KL\f\n #\"\n\f\u001c\u001a\u001d\u0013\u0012\u0013\nK9M5NPO\n\f+\u001a,\u0013JI\u0003I\n\u0005 and-)\nsians and the union bound applied for each value of$ .\nWe will choose \n\u0005\u0005>\nevaluting *\n\f\u001c\u001a\u001d\u0013 . This quantity is observable although calculating it to high precision is\n\u001b\u0015\u0014\na bound on the tail of the monte carlo evaluation. Let*\n\u0011\u0014\u0013 be the\n\f\u001c\u001a\u0010\f\u000e\r\u001d\u0013\n #\"\n\f+\u001a,\u0013\nobserved rate of failure of a\u0016\nrandom hypotheses drawn according to?\nTheorem 2.5 (Sample Convergence Bound) For all distributions,?\n, for all sample sets\u0019\n\u001b\u0017\u0014\nE KL\f\n\u001bHA\n\f\u001c\u001a\u001d\u0013\u000bIOI\u000e*\n #\"\nV\u0002X[Z\nwhere\u0016\nwhere a \u201chead\u201d occurs when an error is observed and the bias is*\n\u001bZA\nempirical error rate*\n\f+\u001a,\u0013 with con\ufb01dence &\n, using our bound on*\nwith con\ufb01dence &\nZ our bound will hold with probability \u0005\n\n\f+\u001a,\u0013\nthen bound the expected true error rate\u001b\n\f\u001c\u001a\u001d\u0013\n\f+\u001a,\u0013 . Since the total probability of failure is only\nZ . In practice, we will use\u0016\ngaussian,? ? The variance of the posterior gaussian needs to be dependent on each weight\n\nin order to achieve a tight bound since we want any \u201cmeaningless\u201d weights to not contribute\nsigni\ufb01cantly to the overall sample complexity. We use a simple greedy algorithm to \ufb01nd\nthe appropriate variance in each dimension.\n1. Train a neural net on the examples\n\n2.3 Distribution Construction algorithm\nOne critical step is missing in the description: How do we calculate the multidimensional\n\nIn order to calculate a bound on the expected true error rate, we will \ufb01rst bound the expected\n\nProof: This is simply an application of the Chernoff bound for the tail of a Binomial\n\nevaluations of the empirical error rate of the stochastic neural network.\n\n\u0005\u0007\u0006\b\u0006\t\u0006\n\nrandom training example. Then, the following simple bound holds:\n\nand applied to a\n\nis the number of evaluations of the stochastic hypothesis.\n\n\f+\u001a,\u0013 .\n\n, which reduces the empirical\naccuracy of the stochastic neural network by some \ufb01xed target percentage (we use\n\n2. For every weight, \u000b\u0002\f , search for the variance, \u000e\n\u0001\u0019\u0018\u001b\u001a\n\n) while holding all other weights \ufb01xed.\n\n\u001f\n\"\n%\n&\n\u0013\n\u001d\n\"\n\u0005\n\u0001\n\u000f\n\f\n$\n\u0004\n\u0005\nF\n?\n*\n\"\n2\n-\n\f\n4\nP\n9\n\u000f\n\f\n\u001d\n\n\u001d\n\n\u000f\n\b\n\u001d\n\n\u001d\n9\n\u0001\n1\n\u000f\nP\nN\n\u0013\n\u001d\n\"\n\u001d\n\u0006\n\u0011\n&\n\n\u0001\n\u0005\n\u0012\n\u0013\nX\nZ\n\u001f\n)\n\u0006\n>\n\u001b\nA\nA\n\u001e\n\u0014\nA\n\u0004\n/\n'\n)\nA\n*\nA\n\u0013\nK\nP\nN\n\u000f\n&\n\u0016\n\u001b\nA\n\u001f\n\u000f\nA\n\u000f\n\u001b\nA\n&\n\u000f\n\u000f\n&\n\u000f\n)\n\u0001\n)\n\u000f\n\f\n\u0005\n\fr\no\nr\nr\ne\n\n100\n\n10\n\n1\n\n0.1\n\n0.01\n\nSNN bound\nNN bound\nSNN Train error\nNN Train error\nSNN Test error\nNN Test error\n\n10000\npattern presentations\n\n100000\n\nr\no\nr\nr\ne\n\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\n10000\npattern presentations\n\n100000\n\nFigure 1: Plot of measured errors and error bounds for the neural network (NN) and the\nstochastic neural network (SNN) on the synthetic problem. The training set has 100 cases\nand the reduction in empirical error is 5%. Note that a true error bound of \u201c100\u201d (visible\nmore examples are required in order to\nin the graph on the left) implies that at least\nmake a nonvacuous bound. The graph on the right expands the vertical scale by excluding\nthe poor true error bound that has error above 100. The curves for NN and SNN are quali-\ntatively similar on the train and test sets. As expected, the SNN consistently performs 5%\nworse than the NN on the train set (easier to see in the graph on the right). Surprisingly,\nthe SNN performs worse than the NN by less than 5% on the test sets. Both NN and SNN\nexhibit over\ufb01tting after about 6000-12000 pattern presentations (600-1200 epochs). The\nshape of the SNN bound roughly mimics the shape of the empirically measured true error\n(this is more visible in the graph on the right) and thus might be useful for indicating where\nthe net begins over\ufb01tting.\n\n\u0002\u0001\u0003\u0001\u0005\u0004\n\n3. The stochastic neural network de\ufb01ned by\n\nempirical error. Therefore, we calculate a global multiplier\nstochastic neural network de\ufb01ned by\nby only the same\n\n(absolute error rate) used in Step 2.\n\n\u0010\u0012\u0011\u0013\n\n\u0006\u0002\u0007\t\b\u000b\n\r\f\n\n\u0010\u0005\f\u0002\u0004\n\n\u0006\u0014\u0007\n\nwill generally have a too-large\n\b\u000f\u000e\nsuch that the\ndecreases the empirical accuracy\n\n4. Then, we evaluate the empirical error rate of the resulting stochastic neural net\nby repeatedly drawing samples from the stochastic neural network. In the work\nreported here we use\n\nsamples.\n\n\u0016\u0015\u0018\u0017\u001a\u0019\n\n\u0014\u0001\u001b\u0001\u001c\u0015\u001d\u0002\u0001\u0003\u0001\u0003\u0001\n\n3 Experimental Results\nHow well can we bound the true error rate of a stochastic neural network? The answer is\nmuch better than we can bound the true error rate of a neural network.\n\n, then the special dimension is drawn from a\n\nWe use two datasets to empirically evaluate the quality of the new bound. The \ufb01rst is a\nsynthetic dataset which has 25 input dimensions and one output dimension. Most of these\nGaussian. One of\ndimensions are useless\u2014simply random numbers drawn from a\n\u001e \u001f!\u0001\"\n#\u0002$\nthe 25 input dimensions is dependent on the label. First, the label\nis drawn uniformly\nfrom\nGaussian. Note that this\nlearning problem can not be solved perfectly because some examples will be drawn from\nthe tails where the gaussians overlap. The \u201cideal\u201d neural net to use in solving this synthetic\nproblem is a single node perceptron. We will instead use a 2 layer neural net with 2 hidden\nnodes using the sigmoid transfer function. This overly complex neural net will result in the\npotential for signi\ufb01cant over\ufb01tting which makes the bound prediction problem interesting.\nIt is also somewhat more \u201crealistic\u201d if the neural net structure does not exactly match the\nlearning problem.\n\n\u001e \u001f!%'\n#\u0002$\n\n\u0006\u001a\u0015&\u0003\n#\n\nThe second dataset is the ADULT problem from the UCI Machine Learning Repos-\nitory. We use a 2 layer neural net with 2 hidden units for this problem as well because\npreliminary experiments showed that nets this small can over\ufb01t the ADULT dataset if the\ntraining sample is small.\n\nTo keep things challenging, we use just\n\nexamples in our experiments. As\n\n\u0014\u0001\u001b\u0001(\u0015*)+\u0001\u001b\u0001\n\n\u0004\n\b\n\b\n\u000e\n%\n\u000e\n\fr\no\nr\nr\ne\n\n100\n\n10\n\n1\n\n0.1\n\n0.01\n\nSNN bound\nNN bound\nSNN Train error\nNN Train error\nSNN Test error\nNN Test error\n\n10000\n\n100000\n\npattern presentations\n\nr\no\nr\nr\ne\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n10000\n\n100000\n\npattern presentations\n\nFigure 2: Plot of measured errors and error bounds for the neural network (NN) and the\nstochastic neural network (SNN) on the UCI ADULT dataset. These graphs show the\nresults obtained using a 1% reduction in empirical error instead of the 5% reduction used\nin Figure 1. The training sample size for this problem is 200 cases. NN and SNN exhibit\nover\ufb01tting after approximately 12000 pattern presentations (600 epochs). As in Figure 1, a\nmore examples are required in order to\ntrue error bound of \u201c100\u201d implies that at least\nmake a nonvacuous bound. The graph on the right expands the vertical scale by excluding\nthe poor true error bound.\n\n\u0014\u0001\u001b\u0001\n\n\u0002\u0001\u0003\u0001\t\u0015\n\n)\u0003\u0001\u0003\u0001\n\nwe will see in Figures 1 and 2, constructing a nonvacuous bound for a continuous hypoth-\nexamples is quite dif\ufb01cult. The conventional bounds are\nesis space with only\nhopelessly loose.\n\nFigure 1 shows the results for the synthetic problem. For this problem we use 100\ntraining cases and a 5% reduction in empirical error. The results for the ADULT problem\nare presented in Figure 2. For this problem we use 200 training cases and a 1% reduction\nin empirical error. Experiments performed on these problems using somewhat smaller and\nlarger training samples yielded similar results. The choice of reduction in empirical error\nis somewhat arbitrary. We see qualitatively similar results if we switch to a 1% reduction\nfor the synthetic problem and a 5% reduction for the ADULT problem.\n\nThere are several things worth noting about the results in the two \ufb01gures.\n\n1. The SNN upper bounds are 2-3 orders of magnitude lower than the NN upper\nbounds. While not as tight as might be desired, the SNN upper bounds are orders\nof magnitude better and are not vacuous.\n\n2. The SNNs perform somewhat better than expected. In particular, on the synthetic\nproblem the SNN true error rate is at most \u0001\nworse than the true error rate of\nthe NN (true error rates are estimated using large test sets). This is suprising\nconsidering that we \ufb01xed the difference in empirical error rates at\nfor the\nsynthetic problem. Similarly, on the ADULT problem we observe that the true\nerror rates between the SNN and NN typically is only about 0.5%, about half of\nthe target difference of 1%. This is good because it suggests that we do not lose\nas much accuracy as might be expected when creating the SNN.\n\n\u0017\u001a\u0019\n\n3. On both test problems, the shape of the SNN bound is somewhat similar to the\nshape of the true error rate. In particular, the local minima in the SNN bound\noccur roughly where the local minima in the true error rates occur. The SNN\nbound may weakly predict the over\ufb01tting points of the SNN and NN nets.\n\nThe comparison between the neural network bound and the stochastic neural network\nbound is not quite \u201cfair\u201d due to the form of the bound. In particular, the stochastic neural\nnetwork bound can never return a value greater than \u201calways err\u201d. This implies that when\nthe bound is near the value of \u201c\n\u201d, it is dif\ufb01cult to judge how rapidly extra examples will\nimprove the stochastic neural network bound. We can judge the sample complexity of\nthe stochastic bound by plotting the value of the numerator in equation 1. Figure 3 plots\nthe complexity versus the number of pattern presentations in training. In this \ufb01gure, we\n\n\u0004\n\u0019\n\n\fl\n\ny\nt\ni\nx\ne\np\nm\no\nC\n\n100\n\n10\n\n1\n\nComplexity\n\n10000\npattern presentations\n\n100000\n\nFigure 3: We plot the \u201ccomplexity\u201d of the stochastic network model (numerator of equation\n1) vs. training epoch. Note that the complexity increases with more training as expected\n\nand stays below \u0005\n\u0006\t\u0006 , implying nonvacuous bounds on a training set of size \u0005\u0007\u0006\b\u0006 .\n\nobserve the expected result: the \u201ccomplexity\u201d (numerator of equation 1) increases with\nmore training and is signi\ufb01cantly less than the number of examples (100).\n\nThe stochastic bound is a radical improvement on the neural network bound but it is not\nyet a perfectly tight bound. Given that we do not have a perfectly tight bound, one impor-\ntant consideration arises: does the minimum of the stochastic bound predict the minimum\nof the true error rate (as predicted by a large holdout dataset). In particular, can we use\nthe stochastic bound to determine when we should cease training? The stochastic bound\ndepends upon (1) the complexity which increases with training time and (2) the training er-\nror which decreases with training time. This dependence results in a minima which occurs\nat approximately 12000 pattern presentations for both of our test problems. The point of\nminimal true error (for the stochastic and deterministic neural networks) occurs at approx-\nimately 6000 pattern presentations for the synthetic problem, and at about 18000 pattern\npresentations for the ADULT problem, indicating that the stochastic bound weakly predicts\nthe point of minimum error. The neural network bound has no such minimum.\n\nIs the choice of 1-5% increased empirical error optimal? In general, the \u201coptimal\u201d\nchoice of the extra error rate depends upon the learning problem. Since the stochastic\nneural network bound (corollary 2.4) holds for all multidimensional gaussian distributions,\nwe are free to optimize the choice of distribution in anyway we desire. Figure 4 shows the\n\n\u0018 are in the right ballpark, and\nresulting bound for different choices of posterior ?\n. The bound has a minimum at \u0006\u0016>\n\u0018 may be unnecessarily large. Larger differences in empirical error rate such as \u0006\u0016>\n\u0018 are\nextra error indicating that our initial choices of \u0006\n\u0005 and \u0006\u0016>\n\u0006\u0016>\n\neasier to obtain reliably with fewer samples from the stochastic neural net, but we have not\nhad dif\ufb01culty using as few as 100 samples from the SNN with as small as a 1% increase in\nempirical error. Also note that the complexity always decreases with increasing entropy in\nthe distribution of our stochastic neural net. The existence of a minimum in Figure 4 is the\n\u201cright\u201d behaviour: the increased empirical error rate is signi\ufb01cant in the calculation of the\ntrue error bound.\n\n4 Conclusion\nWe have applied a PAC-Bayes bound for the true error rate of a stochastic neural network.\nThe stochastic neural network bound results in a radically tighter (\n\u0003 orders of mag-\n\n\u0006\n\u0003\n>\n\u0006\n\u0006\n\u0006\n\u0006\n\u0001\n\fl\n\ny\nt\ni\nx\ne\np\nm\no\nc\n \nr\no\nd\nn\nu\no\nb\n\n \n\n \nr\no\nr\nr\ne\n\n \n\ne\nu\nr\nt\n\n100\n\n10\n\n1\n\n0.1\n\n0\n\n0.02\n\nStochastic NN bound\nComplexity\n\n0.04\n\n0.06\n\nextra training error\n\n0.08\n\n0.1\n\nFigure 4: Plot of the stochastic neural net (SNN) bound for \u201cposterior\u201d distributions chosen\naccording to the extra empirical error they introduce.\n\nnitude) bound on the true error rate of a classi\ufb01er while increasing the empirical and true\nerror rates only a small amount.\n\nAlthough, the stochastic neural net bound is not completely tight, it is not vacuous with\n\njust \u0005\u0007\u0006\b\u0006\n\novertraining occurs.\n\n\u0006\t\u0006 examples and the minima of the bound weakly predicts the point where\n\nThe results with two datasets (one synthetic and one from UCI) are extremely\npromising\u2014the bounds are orders of magnitude better. Our next step will be to test the\nmethod on more datasets using a greater variety of net architectures to insure that the\nbounds remain tight. In addition, there remain many opportunities for improving the ap-\nplication of the bound. For example, it is possible that shifting the weights when \ufb01nding a\nmaximum acceptable variance will result in a tighter bound. Also, we have not taken into\naccount symmetries within the network which would allow for a tighter bound calculation.\n\nReferences\n[1] Peter Bartlett, \u201cThe Sample Complexity of Pattern Classi\ufb01cation with Neural Net-\nworks: The Size of the Weights is More Important than the Size of the Network\u201d,\nIEEE Transactions on Information Theory, Vol. 44, No. 2, March 1998.\n\n[2] V. Koltchinskii\n\nand D. Panchenko,\nthe Generalization Error\n\nBounding\nhttp://citeseer.nj.nec.com/386416.html\n\n\u201cEmpirical Margin Distributions\nof Combined Classi\ufb01ers\u201d,\n\nand\npreprint,\n\n[3] John Langford and Matthias Seeger, \u201cBounds for Averaging Classi\ufb01ers.\u201d CMU tech\n\nreport, 2001.\n\n[4] David MacKay, \u201cProbable Networks and Plausible Predictions - A Review of Practical\n\nBayesian Methods for Supervised Neural Networks\u201d, ??\n\n[5] David McAllester, \u201cSome PAC-Bayes bounds\u201d, COLT 1999.\n\n\u0001\n\n\f", "award": [], "sourceid": 1968, "authors": [{"given_name": "John", "family_name": "Langford", "institution": null}, {"given_name": "Rich", "family_name": "Caruana", "institution": null}]}