{"title": "Supervised autoencoders: Improving generalization performance with unsupervised regularizers", "book": "Advances in Neural Information Processing Systems", "page_first": 107, "page_last": 117, "abstract": "Generalization performance is a central goal in machine learning, particularly when learning representations with large neural networks. A common strategy to improve generalization has been through the use of regularizers, typically as a norm constraining the parameters. Regularizing hidden layers in a neural network architecture, however, is not straightforward. There have been a few effective layer-wise suggestions, but without theoretical guarantees for improved performance. In this work, we theoretically and empirically analyze one such model, called a supervised auto-encoder: a neural network that predicts both inputs (reconstruction error) and targets jointly. We provide a novel generalization result for linear auto-encoders, proving uniform stability based on the inclusion of the reconstruction error---particularly as an improvement on simplistic regularization such as norms or even on more advanced regularizations such as the use of auxiliary tasks. Empirically, we then demonstrate that, across an array of architectures with a different number of hidden units and activation functions, the supervised auto-encoder compared to the corresponding standard neural network never harms performance and can significantly improve generalization.", "full_text": "Supervised autoencoders: Improving generalization\n\nperformance with unsupervised regularizers\n\nLei Le\n\nDepartment of Computer Science\n\nIndiana University\nBloomington, IN\nleile@iu.edu\n\nAndrew Patterson and Martha White\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nEdmonton, AB T6G 2E8, Canada\n\n{ap3, whitem}@ualberta.ca\n\nAbstract\n\nGeneralization performance is a central goal in machine learning, with explicit\ngeneralization strategies needed when training over-parametrized models, like\nlarge neural networks. There is growing interest in using multiple, potentially\nauxiliary tasks, as one strategy towards this goal. In this work, we theoretically\nand empirically analyze one such model, called a supervised auto-encoder: a\nneural network that jointly predicts targets and inputs (reconstruction). We provide\na novel generalization result for linear auto-encoders, proving uniform stability\nbased on the inclusion of the reconstruction error\u2014particularly as an improvement\non simplistic regularization such as norms. We then demonstrate empirically\nthat, across an array of architectures with a different number of hidden units and\nactivation functions, the supervised auto-encoder compared to the corresponding\nstandard neural network never harms performance and can improve generalization.\n\n1\n\nIntroduction\n\nGeneralization is a central concept in machine learning: learning functions from a \ufb01nite set of\ndata, that can perform well on new data. Generalization bounds have been characterized for many\nfunctions, including linear functions [1], and those with low-dimensionality [2, 3] and functions from\nreproducing kernel Hilbert spaces [4]. Many of these bounds are obtained through some form of\nregularization, typically (cid:96)2 regularization [5, 6] or from restricting the complexity of the function\nclass such as by constraining the number of parameters [1].\nUnderstanding generalization performance is particularly critical for powerful function classes, such\nas neural networks. Neural networks have well-known over\ufb01tting issues, with common strategies\nto reduce over\ufb01tting including drop-out [7\u20139], early stopping [10] and data augmentation [11, 12],\nincluding adversarial training [13] and label smoothing [14]. Many layer-wise regularization strategies\nhave also been suggested for neural networks, such as with layer-wise training [15, 16], pre-training\nwith layer-wise additions of either unsupervised learning or supervised learning [15] and the use of\nauxiliary variables for hidden layers [17].\nAn alternative direction that has begun to be explored is to instead consider regularization with the\naddition of tasks. Multi-task learning [18] has been shown to improve generalization performance,\nfrom early work showing learning tasks jointly reduces the required number of samples [19, 20]\nand later work particularly focused on trace-norm regularization on the weights of a linear, single\nhidden-layer neural network for a set of tasks [21\u201323]. Some theoretical work has also been done\nfor auxiliary tasks [24], with the focus of showing that the addition of auxiliary tasks can improve\nthe representation and so generalization. In parallel, a variety of experiments have demonstrated the\nutility of adding layer-wise unsupervised errors as auxiliary tasks [15, 16, 25\u201327]. Auxiliary tasks\nhave also been explored through the use of hints for neural networks [28, 18].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this work, we investigate an auxiliary-task model for which we can make generalization guarantees,\ncalled a supervised auto-encoder (SAE). A SAE is a neural network that predicts both inputs and\noutputs, and has been previously shown empirically to provide signi\ufb01cant improvements when used\nin a semi-supervised setting [16] and deep neural networks [29]. We provide a novel uniform stability\nresult, showing that linear SAE\u2014which consists of the addition of reconstruction error to a linear\nneural network\u2014 provides uniform stability and so a bound on generalization error. We show that the\nstability coef\ufb01cient decays similarly to the stability coef\ufb01cient under (cid:96)2 regularization [5], providing\neffective generalization performance but avoiding the negative bias from shrinking coef\ufb01cients. The\nreconstruction error may incur some bias, but is related to the prediction task and so is more likely to\nprefer a more robust model amongst a set of similarly effective models for prediction. This bound, to\nthe best of our knowledge, is (a) one of the \ufb01rst bounds demonstrating that supervised dimensionality\nreduction architectures can provide improved generalization performance and (b) provides a much\ntighter bound than is possible from applying generalization results from multi-task learning [21\u201323]\nand learning with auxiliary tasks [24]. Finally, we demonstrate empirically that adding reconstruction\nerror never harms performance compared to the corresponding neural network model, and in some\ncases can signi\ufb01cantly improve classi\ufb01cation accuracy.\n\n2 Supervised autoencoders and representation learning\n\nWe consider a supervised learning setting, where the goal is to learn a function for a vector of inputs\nx \u2208 Rd to predict a vector of targets y \u2208 Rm. The function is trained on a \ufb01nite batch of i.i.d.\ndata, (x1, y1), . . . , (xt, yt), with the aim to predict well on new samples generated from the same\ndistribution. To do well in prediction, a common goal is representation learning, where the input\nxi are \ufb01rst transformed into a new representation, for which it is straightforward to learn a simple\npredictor\u2014such as a linear predictor.\nAuto-encoders (AE) are one strategy to extract a representation. An AE is a neural network, where\nthe outputs are set to x, the inputs. By learning to reconstruct the input, the AE extracts underlying or\nabstract attributes that facilitate accurate prediction of the inputs. Linear auto-encoders with a single\nhidden layer are equivalent to principle components analysis [30][31, Theorem 12.1], which \ufb01nds\n(orthogonal) explanatory factors for the data. More generally, nonlinear auto-encoders have indeed\nbeen found to extract key attributes, including high-level features [32] and Gabor-\ufb01lter features [33].\nA supervised auto-encoder (SAE) is an auto-encoder with the addition of a supervised loss on the\nrepresentation layer. For a single hidden layer, this simply means that a supervised loss is added to\nthe output layer, as in Figure 1. For a deeper auto-encoder, the innermost (smallest)1 layer would\nhave a supervised loss added to it\u2014the layer that would usually be handed off to the supervised\nlearner after training the AE. More formally, consider a linear SAE, with a single hidden layer of size\nk. The weights for the \ufb01rst layer are F \u2208 Rd\u00d7k. The weight for the output layer consist of weights\nWp \u2208 Rk\u00d7m to predict y and Wr \u2208 Rk\u00d7d to reconstruct x. Let Lp be the supervised (primary) loss\nand Lr the loss for the reconstruction error. For example, in regression, both losses might be the\nsquared error, resulting in the objective\n\nt(cid:88)\n\n(cid:2)(cid:107)WpFxi \u2212 yi(cid:107)2\n\n(cid:3) .\n\nt(cid:88)\n\n1\nt\n\n[Lp(WpFxi, yi) + Lr(WrFxi, xi)] = 1\n2t\n\n2 + (cid:107)WrFxi \u2212 xi(cid:107)2\n\n2\n\n(1)\n\ni=1\n\ni=1\n\nThe addition of a supervised loss to the auto-encoder should better direct representation learning\ntowards representations that are effective for the desired tasks. Conversely, solely training a represen-\ntation according to the supervised tasks, like learning hidden layers in an neural network, is likely an\nunder-constrained problem, and will \ufb01nd solutions that can well \ufb01t the data but that do not \ufb01nd under-\nlying patterns in the data and do not generalize well. In this way, the combination of the two losses has\nthe promise to both balance extracting underlying structure, as well as providing accurate prediction\nperformance. There have been several empirical papers that have demonstrated the capabilities of\nsemi-supervised autoencoders [16, 27, 34]. Those results focus on the semi-supervised component,\nwhere the use of auto-encoders enables the representation to be trained with more unlabeled data. In\nthis paper, however, we would like to determine if even in the purely supervised setting, the addition\nof reconstruction error can have a bene\ufb01t for generalization.\n\n1The size of the learned representations for deep, nonlinear AEs does not have to be small, but it is common\nto learn such a lower-dimensional representations. For linear SAEs, the hidden layer size k < d, as otherwise\ntrivial solutions like the replication of the input are able to minimize the reconstruction error.\n\n2\n\n\f(a) (Linear) Supervised Autoencoder\n\n(b) Deep Supervised Autoencoder\n\nFigure 1: Two examples of Supervised Autoencoders, and where the supervised component\u2014the\ntargets y\u2014are included. We provide generalization performance results for linear SAEs, represented\nby (a) assuming a linear activation to produce the hidden layer, with arbitrary convex losses on the\noutput layer, such as the cross-entropy for classi\ufb01cation. We investigate more general architectures in\nthe experiments, including single-hidden layer SAEs, represented by (a) with nonlinear activations to\nproduce the hidden layer, and deep SAEs, depicted in (b).\n\n3 Uniform stability and generalization bounds for SAE\n\nIn this section, we show that including the reconstruction error theoretically improves generalization\nperformance. We show that linear supervised auto-encoders are uniformly stable, which means that\nthere is a small difference between models learned for any subsample of data, which differ in only\none instance. Uniformly stable algorithms are known to have good generalization performance [5].\nBefore showing this result, we discuss a few alternatives to justify why we pursue uniform stability.\nThere are at least two alternative strategies that could be considered to theoretically analyze these\nmodels: using a multi-task analysis and characterizing the Rademacher complexity of the supervised\nauto-encoder function class. The reconstruction error can in-fact be considered as multiple tasks,\nwhere the multiple tasks help regularize or constrain the solution [35]. Previous results for multi-task\nlearning [21\u201323] demonstrate improved generalization error bounds when learning multiple tasks\njointly. Unfortunately, these bounds show performance is improved on average across tasks. For\nour setting, we only care about the primary tasks, with the reconstruction error simply included\nas an auxiliary task to regularize the solution. An average improvement might actually mean that\nperformance on the primary task degrades with inclusion of these other tasks. Earlier multi-task work\ndid consider improvement for each task [36], but assumed different randomly generated features for\neach task and all tasks binary classi\ufb01cation problem, which does not match this setting.\nAnother strategy is to characterize Rademacher complexity of supervised auto-encoders. There has\nbeen some work characterizing the Rademacher complexity of unsupervised dimensionality reduction\ntechniques [3, Theorem 3.1]. To the best of our knowledge, however, there as yet does not appear to\nbe an analysis on complexity of supervised dimensionality reduction techniques. There is some work\non supervised dimension reduction [2, 3]; however, this analysis assumes a dimensionality reduction\nstep followed by a supervised learning step, rather than a joint training procedure.\nFor these reasons, we pursue a third direction, where we treat the reconstruction error as a regularizer\nto promote stability. Uniform stability has mainly been obtained using norm-based regulariza-\ntion strategies, such as (cid:96)2. More recently, Liu et al. [24] showed that auxiliary tasks\u2014acting as\nregularizers\u2014could also provide uniform stability. Because reconstruction error can be considered\nto be an auxiliary task, our analysis resembles this auxiliary-task analysis. However, there are key\ndifferences, as the result by Liu et al. [24] would be uninteresting if simply applied directly to our\nsetting. In particular, the uniform stability bound would not decay with the number of samples. The\nbound decays proportionally to the number of samples for the primary task, but in the numerator\ncontains the maximum number of samples for an auxiliary task. For us, this maximum number is\nexactly the same number of samples as for the primary task, and so they would cancel, making the\nbound independent of the number of samples.\n\n3\n\nxhxEncoderDecoderyInputCodeOutputxh1h2h3xEncoderDecoderyInputCodeOutput\fPrimary Result\nWe now show that the parameter shared by the primary task and reconstruction error\u2014the forward\nmodel F\u2014does not change signi\ufb01cantly with the change of one sample. This shows that linear SAEs\nhave uniform stability, which then immediately provides a generalization bound from [5, Theorem\n12]. The proofs are provided in the appendix, for space.\nLet Lp corresponds to the primary part (supervised part) of the loss, with weights Wp, and Lr\ncorrespond to the auxiliary tasks that act as regularizers (the reconstruction error), with weights Wr.\nThe full loss can be written\n\nLp (WpFxi, yp,i) + Lr (WrFxi, yr,i) .\n\n(2)\n\nt(cid:88)\n\ni=1\n\nL(F) =\n\n1\nt\n\nFor our speci\ufb01c setting, yr = x. We use more general notation, however, both to clarify the difference\nbetween the inputs and outputs, and for future extensions to this theory for other (auxiliary) targets\nyr. The loss where the m-th sample (xm, ym) is replaced by a random new instance (x(cid:48)\n\nm, y(cid:48)\n\nm) is\n\n(cid:104)\n\n(cid:0)WpFx(cid:48)\n\nm, y(cid:48)\n\np,m\n\nLm(F)=1\nt\n\nLp\n\n(cid:1)+Lr\n\n(cid:0)WrFx(cid:48)\n\nm, y(cid:48)\n\nr,m\n\nt(cid:88)\n\n(cid:1)+\n\ni=1,i(cid:54)=m\n\nLp(WpFxi, yp,i)+Lr(WrFxi, yr,i)\n\n.\n\n(cid:105)\n\nIf we let F, Fm correspond to the optimal forward models for these two losses respectively, then the\nalgorithm is said to be \u03b2-uniformly stable if the difference in loss value for these two models for any\npoint (x, y) is bounded by \u03b2 with high-probability\n\n|Lp (WpFmx, yp) \u2212 Lp (WpFx, yp)| \u2264 \u03b2\n\nTo obtain uniform stability, we will need to make several assumptions. The \ufb01rst common assumption\nis to assume bounded spaces, for the data and learned variables.\nAssumption 1. The features satisfy (cid:107)x(cid:107)2 \u2264 Bx and the primary targets satisfy (cid:107)yp(cid:107)2 \u2264 By. The\nparameters spaces are bounded,\n\nW = {(Wp, Wr) \u2208 Rk\u00d7m : (cid:107)Wp(cid:107)F \u2264 BWp ,(cid:107)Wr(cid:107)F \u2264 BWr}\nF = {F \u2208 Rd\u00d7k : (cid:107)F(cid:107)F \u2264 BF}\n\nfor some positive constants Bx, By, BF, BWp , BWr, where (cid:107) \u00b7 (cid:107)F denotes Frobenius norm, namely\nthe square root of the sum of the squares of all elements.\nFor SAE, yr = x, and so (cid:107)x(cid:107)2 \u2264 Bx implies that (cid:107)yr(cid:107)2 \u2264 Bx.\nSecond, we need to ensure that the reconstruction error is both strongly convex and Lipschitz. The\nnext two assumptions are satis\ufb01ed, for example, by the (cid:96)2 loss, Lr(\u02c6y, y) = (cid:107)\u02c6y \u2212 y(cid:107)2\n2.\nAssumption 2. The reconstruction loss Lr(\u00b7, y) is \u03c3r-admissible, i.e., for possible predictions \u02c6y, \u02c6y(cid:48)\n\n|Lr (\u02c6y, y) \u2212 Lr (\u02c6y(cid:48), y)| \u2264 \u03c3r(cid:107)\u02c6y \u2212 \u02c6y(cid:48)(cid:107)2.\n\nAssumption 3. Lr(\u00b7, y) is c-strongly-convex (cid:104)\u02c6y \u2212 \u02c6y(cid:48),\u2207Lr (\u02c6y, y) \u2212 \u2207Lr (\u02c6y(cid:48), y)(cid:105) \u2265 c(cid:107)\u02c6y \u2212 \u02c6y(cid:48)(cid:107)2\nThe growth of the primary loss also needs to be bounded; however, we can use a less stringent\nrequirement than admissibility.\nAssumption 4. For some \u03c3p > 0, for any F, Fm \u2208 F,\n\n2\n\n|Lp (WpFmx, yp)\u2212Lp (WpFx, yp)| \u2264 \u03c3p(cid:107)Wr(Fm \u2212 F)x(cid:107)2\n\nThis requirement should be less stringent because we expect generally that for two forward models\nF, Fm, (cid:107)Wp(F \u2212 Fm)x(cid:107)2 \u2264 (cid:107)Wr(F \u2212 Fm)x(cid:107)2. The matrix Wp \u2208 Rm\u00d7k projects the vector\nd = (F \u2212 Fm)x into a lower-dimensional space, whereas Wr \u2208 Rd\u00d7k projects d into a higher-\ndimensional space. Because the nullspace of Wp is likely larger, it is more likely that Wp will send\na non-zero d. In fact, if Wr is full rank\u2014which occurs if k is less than or equal to the intrinsic rank\nof the data\u2014then we can guarantee this assumption for some \u03c3p as long as Lp is \u03c3-admissible, where\nlikely \u03c3p can be smaller than \u03c3. In Corollary 1, we specify the value of \u03c3p under a full rank Wr and\n\u03c3-admissible Lp.\nFinally, we assume that there is a representative set of feature vectors in the sampled data, both in\nterms of feature vectors (Assumption 5) as well as loss values (Assumption 6).\n\n4\n\n\fAssumption 5. There exists a subset\n\nerror: x =(cid:80)n\n\ni=1 \u03b1ibi + \u03b7 where \u03b1i \u2208 R,(cid:80)n\n\nB = {b1, b2, ..., bn} \u2282 {x1, x2, ..., xt}\n\ni=1 \u03b12\n\ni \u2264 r,(cid:107)\u03b7(cid:107)2 \u2264 \u0001\nt .\n\nsuch that with high probability any sampled feature vector x can be reconstructed by B with a small\n\nAssumption 5 is similar to [24, Assumption 1], except for our setting the features are the same for all\nthe tasks and the upper bound of (cid:107)\u03b7(cid:107) decreases as 1\nt . This is a reasonable assumption since more\nsamples in the training set make it more likely to be able to reconstruct any x that will be observed\nwith non-negligible probability. In many cases, \u03b7 = 0 is a mild assumption, as once d independent\nvectors bi are observed, \u03b7 = 0.\nThis representative set of points also needs to be representative in terms of the reconstruction error. In\nparticular, we need the average reconstruction error of the representative points to be upper bounded\nby some constant factor of the average reconstruction error under the training set.\nAssumption 6. For any two datasets S, Sm, where Sm has the m-th sample replaced with a random\nnew instance, let F, Fm be the corresponding optimal forward models. Let N contain only the\nreconstruction errors, without the sample that is replaced\n\nand Nb be the reconstruction error for the representative points\n\nt(cid:88)\n\nn(cid:88)\n\ni=1\n\nN (F) =\n\n1\nt\n\ni=1,i(cid:54)=m\n\nLr (WrFxi, yr,i)\n\nNb(F) =\n\n1\nn\n\nLr (WrFbi, yr,bi )\n\n(3)\n\n(4)\n\nwhere yr,bi is the reconstruction target for representative point bi. Then, there exists a > 0 such that\nfor any small \u03b1 > 0,\n\n[Nb(F) \u2212 Nb((1 \u2212 \u03b1)F + \u03b1Fm)] + [Nb(Fm) \u2212 Nb((1 \u2212 \u03b1)Fm + \u03b1F)]\n\u2264 a [N (F) \u2212 N ((1 \u2212 \u03b1)F + \u03b1Fm)] + a [N (Fm) \u2212 N ((1 \u2212 \u03b1)Fm + \u03b1F)] .\n\nThe above assumption does not require that the difference under N and Nb be small for the two F\nand Fm; rather, it only requires that the increase or decrease in error at the two points Fm and F are\nsimilar for N and Nb. Both the right-hand-side and left-hand-side in the assumption are nonnegative,\nbecause of the convexity of N and Nb. Even if N is higher at F than Fm, and Nb is the opposite,\nthe above bound can hold, because it simply requires that the difference of Nb between Fm and F\nbe bounded above by the difference of N between F and Fm, up to some constant factor a. This\nassumption is key, because we will need to use Nb to ensure that the bound decays with t, where Nb\nis only dependent on the number of representative points, unlike N.\nWe can now provide the key result: SAE has uniform stability wrt the shared parameters F.\nTheorem 1. Under Assumptions 1-6, for a randomly sampled x, y, with high probability\n|Lp(WpFmx, y) \u2212 Lp (WpFx, y)| \u2264 a(\u03c3r+\u03c3p)n\u03c3p\n\n+ 2\u0001\u03c3pBWr BF\n\n(cid:113)\n\n(cid:18)\n\n(cid:19)\n\n(5)\n\nr+\n\nr2 + 4\u0001cBWr BFr\na(\u03c3r+\u03c3p)n\n\nct\n\nt\n\nRemark: We similarly get O( 1\nt ) upper bound on instability from Bousquet and Elisseeff [5], but\nwithout requiring the (cid:96)2 regularizer. The (cid:96)2 indiscriminately reduces the magnitude of the weights;\nthe reconstruction error, on the other hand, regularizes, but potentially without strongly biasing the\nsolution. It can select amongst a set of possible forward models that predict the targets almost equally\nwell, but that also satisfy reconstruction error. A hidden representation that is useful for reconstructing\nthe inputs is likely to also be effective for predicting the targets\u2014which are a function of the inputs.\nCorollary 1. In Assumption 4, if Wp \u2208 Rm\u00d7k, Wr \u2208 Rd\u00d7k, d \u2265 k \u2265 m, Wr is full rank, Lp is\nr (cid:107)F .\n\u03c3-admissible, then for W\u22121\nFinally, we provide a few speci\ufb01c bounds, for particular Lr and Lp, to show how this more general\nbound can be used (shown explicitly in Appendix B). For example, for a least-squares reconstruction\nloss Lr, c = 2 and \u03c3r = 2BWr BFBx + 2Bx.\n\nthe inverse matrix of the \ufb01rst k rows of Wr, \u03c3p = \u03c3(cid:107)Wp(cid:107)F(cid:107)W\u22121\n\nr\n\n5\n\n\f4 Experiments with SAE: Utility of reconstruction error\n\nWe now empirically test the utility of incorporating the reconstruction error into NNs, as a method\nfor regularization to improve generalization performance. Our goal is to investigate the impact of the\nreconstruction error, and so we use the same architecture for SAE and NN, where the only difference\nis the use of reconstruction error. We test several different architectures, namely single-hidden layer\nSAEs with different activations, adding non-linearity with kernels before using a linear SAE and a\ndeep SAE with a bottleneck, namely a hidden layer with smaller size than that of the previous layer.\nExperimental setup and Datasets. We used 10-fold cross-validation to choose the best meta-\nparameters for each algorithm on each dataset. The meta-parameters providing the highest classi\ufb01ca-\ntion accuracy averaged across folds are chosen. Using the meta-parameters chosen by cross-validation,\nwe report the average accuracy and standard error across 20 runs, each with a different randomly\nsampled training-testing splits. A new training-testing split is generated by shuf\ufb02ing all data points\ntogether and selecting the \ufb01rst samples to be the training set, and the remaining to be the testing set.\nSUSY is a high-energy particle physics dataset [37]. The goal is to classify between a process where\nsupersymmetric particles are produced, and a background process where no detectable particles\nare produced. SUSY was generated to discover hidden representations of raw sensor features for\nclassi\ufb01cation [37], and has 8 features and 5 million data points.\nDeterding is a vowel dataset [38] containing 11 steady-state vowels of British English spoken by 15\nspeakers. Every speaker pronounced each of the eleven vowel sounds six times giving 990 labeled\ndata points. The goal is to classify the vowel sound for each spoken vowel, where each speech signal\nis converted into a 10-dimensional feature vector using log area ratios based on linear prediction\ncoef\ufb01cients. We normalized each feature between 0 and 1 through Min-Max scaling.\nCIFAR-10 is an image dataset [39] with 10 classes and 60000 32x32 color images. The classes\ninclude objects like horses, deer, trucks and airplanes. For each of the training-test splits, we used a\nrandom subset of 50,000 images for training and 10,000 images for testing. We preprocessed the data\nby averaging together the three colour channels creating gray-scale images to speed up computation.\nMNIST is a dataset [40] of 70000 examples of 28x28 images of handwritten digits from 0 to 9.\nWe would like to note that for these two benchmark datasets\u2014CIFAR and MNIST\u2014impressive\nperformance has been achieved, such as with a highly complex, deep neural network model for\nCIFAR [41]. Here, however, we use these datasets to investigate a variety of models, rather than to\nmatch performance of the current state-of-the-art. We do not use the provided single training-testing\nsplit, but rather treat these large datasets as an opportunity to generate many (different) training-test\nsplits for a thorough empirical investigation.\nOverall results. Figure 2 shows the performance of SAE versus NN. On the Deterding, SUSY and\nMNIST datasets, we compare them in three different architectures. First, we compare linear SAE\nwith linear NN, where there is no activation function from the input to the hidden layer. Second, we\nnonlinearly transform the data with radial basis functions\u2014a Gaussian kernel\u2014and then use linear\nSAE and linear NNs. The kernel expansion enables nonlinear functions to be learned, despite the fact\nthat the learning step can still bene\ufb01t from the optimality results provided for linear SAE. Third, we\nuse nonlinear activation functions, sigmoid and ReLu, from the input to the hidden layer. Though this\nis outside the scope of the theoretical characterization, it is a relatively small departure and important\nto understand the bene\ufb01ts of the reconstruction error for at least simple nonlinear networks. We\ninvestigate only networks with single hidden layers as a \ufb01rst step, and to better match the networks\ncharacterized in the theoretical guarantees.\nOverall, we \ufb01nd that SAE improves performance across settings, in some cases by several percent.\nGetting even an additional 1% in classi\ufb01cation accuracy with just the addition of reconstruction error\nto relatively simple models is a notable result. We summarize these results in Figure 2 and Table 1.\nSAE and NN with the same architecture have similar sample variances, so we use a t-test for statistical\nsigni\ufb01cance. For all pairs but one, the average accuracy of SAE is statistically signi\ufb01cantly higher\nthan that of NN, with signi\ufb01cance level 0.0005, though in some cases the differences are quite small,\nparticularly on SUSY and MNIST. In other cases, particularly in kernel representations in Deterding,\nSAE signi\ufb01cantly outperformed NN, with a jump by 18% in classi\ufb01cation accuracy. Because we\nattempted to standardize the models, differing only in SAE using reconstruction error, these results\nindicate that the reconstruction error has a clear positive impact on generalization performance.\n\n6\n\n\f(a) Deterding dataset\n\n(b) SUSY dataset\n\n(c) MNIST dataset\n\nFigure 2: Test accuracy of a three layer neural network (NN) and our supervised auto-encoder model\n(SAE), on three datasets. We focus on the impact of using reconstruction error, and compare SAE\nand NN with a variety of nonlinear structures, including sigmoid (SAE-Sigmoid and NN-Sigmoid),\nReLu (SAE-ReLu and NN-ReLu) and Gaussian kernel (SAE-Kernel and NN-Kernel). Though not\nshowing the results in the \ufb01gure, we also tried initializing NN with pre-trained autoencoders and\nthe performance is similar to NN, thus outperformed by SAE as well. Overall, SAE consistently\noutpeforms NNs, though in some cases the advantage is small. Details are shown in Table 1.\n\nMNIST\n\nTraining\n\nDeterding\n\nSUSY\n\nTest\n\nTraining\n\n63.34 \u00b1 0.17\n61.05 \u00b1 0.14\n99.38 \u00b1 0.03\n97.62 \u00b1 0.05\n90.22 \u00b1 0.41\n78.76 \u00b1 0.08\n93.15 \u00b1 0.11\n82.37 \u00b1 0.41\n\nAverage Accuracy \u00b1 Standard Error Average Accuracy \u00b1 Standard Error Average Accuracy \u00b1 Standard Error\n93.70 \u00b1 0.30\n54.98 \u00b1 0.18\n92.50 \u00b1 0.22\n52.50 \u00b1 0.17\n96.35 \u00b1 0.05\n90.67 \u00b1 0.12\n87.00 \u00b1 0.14\n96.20 \u00b1 0.04\n85.47 \u00b1 0.52\n98.25 \u00b1 0.08\n72.29 \u00b1 0.67\n98.10 \u00b1 0.09\n97.40 \u00b1 0.18\n92.52 \u00b1 0.10\n74.85 \u00b1 0.20\n96.20 \u00b1 0.20\n\nSAE\nNN\nSAE-Sigmoid\nNN-Sigmoid\nSAE-ReLu\nNN-ReLu\nSAE-Kernel\nNN-Kernel\nTable 1: The percentage accuracy for the results presented in Figure 2. SAE outperforms NNs in\nterms of average test accuracy across settings. The only exception is the Gaussian kernel on SUSY,\nwhere the advantage of NN-Kernel is extremely small. We report train accuracies for further insights\nand completeness. Note that though there is some amount of over\ufb01tting occurring, the models were\ngiven the opportunity to select a variety of regularization parameters for (cid:96)2 regularization as well as\ndropout using cross-validation.\n\nTraining\n\n76.50 \u00b1 0.03\n76.42 \u00b1 0.02\n77.80 \u00b1 0.01\n76.90 \u00b1 0.01\n72.04 \u00b1 0.33\n75.03 \u00b1 0.11\n77.31 \u00b1 0.12\n77.42 \u00b1 0.06\n\nTest\n\n92.20 \u00b1 0.40\n91.20 \u00b1 0.20\n94.50 \u00b1 0.10\n92.50 \u00b1 0.10\n98.00 \u00b1 0.10\n97.30 \u00b1 0.10\n96.70 \u00b1 0.20\n95.50 \u00b1 0.20\n\nTest\n\n76.48 \u00b1 0.01\n76.41 \u00b1 0.02\n77.79 \u00b1 0.02\n76.90 \u00b1 0.03\n71.99 \u00b1 0.58\n65.27 \u00b1 0.17\n77.27 \u00b1 0.06\n77.38 \u00b1 0.06\n\nIn the next few sections, we highlight certain properties of interest, in addition to these more general\nperformance results. We highlight robustness to over\ufb01tting as model complexity is increased, for both\nnonlinear activations and kernel transformations. For these experiments, we choose CIFAR, since\nit is a more complex prediction problem with a large amount of data. We then report preliminary\nconclusions on the strategy of over-parametrizing and regularizing, rather than using bottleneck layers.\nFinally, we demonstrate the structure extracted by SAE, to gain some insight into the representation.\nRobustness to over\ufb01tting. We investigate the impact of increasing the hidden dimension on CIFAR,\nwith sigmoid and ReLu activation functions from the input to the hidden layer. The results are\nsummarized in Figures 3a and 3b, where the hidden dimension is increased from 20 to as large as 10\nthousand. Both results indicate that SAE can better take advantage of increasing model complexity,\nwhere (a) the NN clearly over\ufb01t and obtained poor accuracy with a sigmoid transfer and (b) SAE\ngained a 2% accuracy improvement over NNs when both used a ReLu transfer.\nResults with kernels. The overall conclusion is that SAE can bene\ufb01t much more from model\ncomplexity given by kernel representations, than NNs. In Table 1, the most striking difference\nbetween SAE and NNs with kernels occurs for the Deterding dataset. SAE outperforms NN by\nan entire 18%, going from 75% test accuracy to 92% test accuracy. For SUSY, SAE and NNs\nwere essentially tied; but for that dataset, all the nonlinear architectures performed very similarly,\nsuggesting little improvement could be gained.\n\n7\n\nSAENNSAE-SigmoidNN-SigmoidSAE-ReLuNN-ReLuSAE-KernelNN-Kernel020406080100Test AccuracySANNSA-SigmoidNN-SigmoidSA-ReLuNN-ReLuSA-KernelNN-Kernel020406080020406080100\f(a) Sigmoid activation\n\n(b) ReLu activation\n\n(c) Kernel Representation\n\nFigure 3: Test accuracy of SAE and NN with a variety of nonlinear architectures on CIFAR, with\nincreasing model complexity. For the sigmoid and relu, the hidden dimension is increased; for\nkernels, the number of centers is increased. (a) For the sigmoid activation, the NN suffers noticeably\nfrom over\ufb01tting as the hidden dimension increases, whereas SAE is robust to the increase in model\ncomplexity. (b) For the ReLu activation, under low model complexity, SAE performed more poorly\nthan the NN. However, given a larger hidden dimension\u2014about a half as large as the input dimension\u2014\nit reaches the same level of performance and then is better able to take advantage of the increase\nmodel complexity. The difference of about 2% accuracy improvement for such a simple addition\u2014the\nreconstruction error\u2014is a striking result. (c) The result here is similar to ReLu. Note that the size of\nthe hidden dimension corresponds to 10% of the number of centers.\n\nOn CIFAR, we also investigated the impact of increasing the number of kernel centers, which\ncorrespondingly increases model parameters and model complexity. We \ufb01xed the hidden dimension\nto 10% of the number of centers, to see if the SAE could still learn an appropriate model even with\nan aggressive bottleneck, namely a hidden layer with a relatively very small size, making it hard to\nreduce the reconstruction error. This helps to verify the hypothesis that the reconstruction error does\nnot incur much bias as a regularizer, and test a more practical setting where an aggressive bottleneck\ncan signi\ufb01cantly speed up computation and convergence rate. For the NN, because the number of\ntargets is 10, once the hidden dimension k \u2265 10, the bottleneck should have little to no impact on\nperformance, which is what we observe. The result is summarized in Figure 3c, which shows that\nSAE initially suffers when model complexity is low, but then surpasses the NN with increasing model\ncomplexity. In general, we anticipate the effects with kernels and SAE to be more pronounced with\nmore powerful selection of kernels and centers.\n\nDemonstration of SAE with a Deep Architecture.\nWe investigate the effects of adding the\nreconstruction loss to deep convolutional models on CIFAR. We use a network with two convolutional\nlayers of sizes {32, 64} and 4 dense layers of sizes {2048, 512, 128, 32} with ReLu activation. Unlike\nour previous experiments we do not use grey-scale CIFAR, but instead use all three color channels\nfor the deep networks to make maximal use of the convolutional layers.\nAs shown in Figure 4, SAE outperforms NN consistently in both train and test accuracies, suggesting\nthat SAE is able to \ufb01nd a different and better solution than NN in the optimization on the training data\nand generalize well on the testing data. We show the performance of SAE with decreasing weight\non the predictive loss, which increases the effect of the reconstruction error. Interestingly, a value\nof 0.01 performs the best, but began to degrade with lower values. At the extreme, for a weight of\n0 which corresponds to an Autoencoder, performance is signi\ufb01cantly worse, so the combination of\nboth is necessary. We discuss other variants we tried in the caption for Figure 4, but the conclusions\nremain consistent: SAE improves generalization performance over NNs.\n\n5 Conclusion\n\nIn this paper, we systematically investigated supervised auto-encoders (SAEs), as an approach to\nusing unsupervised auxiliary tasks to improve generalization performance. We showed theoretically\nthat the addition of reconstruction error improves generalization performance, for linear SAEs. We\nshowed empirically, across four different datasets, with a variety of architectures, that SAE never\nharms performance but in some cases can signi\ufb01cantly improve performance, particularly when using\nkernels and under ReLu activations, for both shallow and deep architectures.\n\n8\n\nNN test NN train SAE testSAE trainHidden Dimension2K4K6K8K10K0.20.220.240.260.280.30.320.34AccuracyHidden Dimension2K4K6K8K10KSAENN0.360.380.40.420.440.460.480.5Hidden Dimension2K4K6K8K10KSAENN0.360.380.40.420.440.460.480.5\f(a) Train accuracy\n\n(b) Test accuracy\n\nFigure 4: Train and Test accuracy of SAE and NN with a deep architecture. The numbers 0.01, 0.1\nand 1.0 denote the weights on the prediction error, with a constant weights of 1.0 on the reconstruction\nerror. We also compared to Auto-encoders, with a two-stage training strategy where the auto-encoder\nis trained \ufb01rst, with the representation then used for the supervised learner, but this performed\npoorly (about 0.4 testing accuracy). We additionally investigated both dropout and (cid:96)2 regularization.\nWe \ufb01nd that dropout increases the variance of independent runs, and improves each algorithm by\napproximately three percentage points over its reported test set accuracy. Using (cid:96)2 regularization did\nnot improve performance. Under both dropout and (cid:96)2, the advantage of SAE over NN in both train\nand test accuracies remained consistent, and so these graphs are representative for those additional\nsettings. Finally, we additionally compared to the ResNet-18 architecture [42]. For a fair comparison,\nwe do not use the image augmentation originally used in training ResNet-18. We \ufb01nd that ResNet-18,\nwith nearly double the total learnable parameters, achieved only two percentage points higher on the\ntest set accuracy than our SAE with reconstructive loss.\n\nReferences\n[1] Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the Complexity of Linear Prediction: Risk\nBounds, Margin Bounds, and Regularization. In Advances in Neural Information Processing Systems,\n2008.\n\n[2] Mehryar Mohri, Afshin Rostamizadeh, and Dmitry Storcheus. Generalization Bounds for Supervised\nDimensionality Reduction. In NIPS Workshop Feature Extraction Modern Questions and Challenges,\n2015.\n\n[3] Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Adaptive metric dimensionality reduction.\n\nTheoretical Computer Science, 2016.\n\n[4] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: risk bounds and structural\n\nresults. The Journal of Machine Learning Research, 2002.\n\n[5] Olivier Bousquet and Andr\u00e9 Elisseeff. Stability and Generalization. Journal of Machine Learning Research,\n\n2002.\n\n[6] Tong Zhang. Covering Number Bounds of Certain Regularized Linear Function Classes. Journal of\n\nMachine Learning Research, 2002.\n\n[7] Stefan Wager, Sida Wang, and Percy S Liang. Dropout Training as Adaptive Regularization. In Advances\n\nin Neural Information Processing Systems, 2013.\n\n[8] N Srivastava, G Hinton, A Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple Way\n\nto Prevent Neural Networks from Over\ufb01tting . Journal of Machine Learning Research, 2014.\n\n[9] Hyeonwoo Noh, Tackgeun You, Jonghwan Mun, and Bohyung Han. Regularizing Deep Neural Networks\nby Noise: Its Interpretation and Optimization. In Advances in Neural Information Processing Systems,\n2017.\n\n[10] N Morgan, H Bourlard, and 1990. Generalization and parameter estimation in feedforward nets: Some\n\nexperiments. In Advances in Neural Information Processing Systems, 1990.\n\n[11] Larry Yaeger, Richard Lyon, and Brandyn Webb. Effective Training of a Neural Network Character\n\nClassi\ufb01er for Word Recognition. In Advances in Neural Information Processing Systems, 1997.\n\n9\n\nNNSAE-0.1SAE-1.0SAE-0.010.50.650.60(cid:17)(cid:24)(cid:24)0.70.750.80.850.90.951.0162468101214Number of EpochsAccuracy0.50.650.60(cid:17)(cid:24)(cid:24)0.70.751713579111315NNSAE-1.0SAE-0.01SAE-0.1Number of Epochs\f[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classi\ufb01cation with Deep Convolutional\n\nNeural Networks. In Advances in Neural Information Processing Systems, 2012.\n\n[13] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. DeepFool - A Simple and\nAccurate Method to Fool Deep Neural Networks. In IEEE Conference on Computer Vision and Pattern\nRecognition, 2016.\n\n[14] Bin-Bin Gao, Chao Xing, Chen-Wei Xie, Jianxin Wu, and Xin Geng. Deep Label Distribution Learning\n\nWith Label Ambiguity. IEEE Transactions on Image Processing, 2017.\n\n[15] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep\n\nnetworks. In Advances in Neural Information Processing Systems, 2007.\n\n[16] Marc\u2019Aurelio Ranzato and Martin Szummer. Semi-supervised learning of compact document representa-\n\ntions with deep networks. In International Conference on Machine Learning, 2008.\n\n[17] Miguel \u00c1 Carreira-Perpi\u00f1\u00e1n and Weiran Wang. Distributed optimization of deeply nested systems. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2014.\n\n[18] Rich Caruana. Multitask Learning. Machine Learning, 1997.\n\n[19] Jonathan Baxter. Learning internal representations. In Annual Conference on Learning Theory, 1995.\n\n[20] Jonathan Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Research, 2000.\n\n[21] Andreas Maurer. Bounds for Linear Multi-Task Learning. Journal of Machine Learning Research, 2006.\n\n[22] Andreas Maurer and Massimiliano Pontil. Excess risk bounds for multitask learning with trace norm\n\nregularization. In Annual Conference on Learning Theory, 2013.\n\n[23] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The Bene\ufb01t of Multitask Repre-\n\nsentation Learning. arXiv:1509.01240v2, 2015.\n\n[24] Tongliang Liu, Dacheng Tao, Mingli Song, and Stephen J Maybank. Algorithm-Dependent Generalization\nBounds for Multi-Task Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.\n\n[25] Jason Weston, Fr\u00e9d\u00e9ric Ratle, and Ronan Collobert. Deep learning via semi-supervised embedding. In\n\nInternational Conference on Machine Learning, 2008.\n\n[26] Alexander G Ororbia II, C Lee Giles, and David Reitter. Learning a Deep Hybrid Model for Semi-\n\nSupervised Text Classi\ufb01cation. EMNLP, 2015.\n\n[27] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised\n\nLearning with Ladder Networks. In Advances in Neural Information Processing Systems, 2015.\n\n[28] Yaser S Abu-Mostafa. Learning from hints in neural networks. J. Complexity, 1990.\n\n[29] Yuting Zhang, Kibok Lee, and Honglak Lee. Augmenting supervised neural networks with unsupervised\nobjectives for large-scale image classi\ufb01cation. In International Conference on Machine Learning, pages\n612\u2013621, 2016.\n\n[30] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples\n\nwithout local minima. Neural Networks, 1989.\n\n[31] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT\n\nPress, 2012.\n\n[32] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked\nDenoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising\nCriterion. Journal of Machine Learning Research, 2010.\n\n[33] Marc\u2019Aurelio Ranzato, Christopher S Poultney, Sumit Chopra, and Yann LeCun. Ef\ufb01cient Learning of\nSparse Representations with an Energy-Based Model. In Advances in Neural Information Processing\nSystems, 2006.\n\n[34] Anupriya Gogna and Angshul Majumdar. Semi Supervised Autoencoder. In Neural Information Processing.\n\n2016.\n\n[35] R Caruana and V R De Sa. Promoting poor features to supervisors: Some inputs work better as outputs. In\n\nAdvances in Neural Information Processing Systems, 1997.\n\n10\n\n\f[36] S Ben-David and R Schuller. Exploiting task relatedness for multiple task learning. Lecture Notes in\n\nComputer Science, 2003.\n\n[37] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for Exotic Particles in High-Energy Physics\n\nwith Deep Learning. arXiv:1509.01240v2, 2014.\n\n[38] David Henry Deterding. Speaker normalisation for automatic speech recognition. PhD thesis, University\n\nof Cambridge, 1990.\n\n[39] A Krizhevsky and G Hinton. Learning Multiple Layers of Features from Tiny Images. Technical report,\n\nUniversity of Toronto, 2009.\n\n[40] Y LeCun, L Bottou, Y Bengio, and P Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 1998.\n\n[41] Benjamin Graham. Fractional Max-Pooling. arXiv:1411.4000v2 [cs.LG], 2014.\n\n[42] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n11\n\n\f", "award": [], "sourceid": 88, "authors": [{"given_name": "Lei", "family_name": "Le", "institution": "Indiana University Bloomington"}, {"given_name": "Andrew", "family_name": "Patterson", "institution": "University of Alberta"}, {"given_name": "Martha", "family_name": "White", "institution": "University of Alberta"}]}