{"title": "General Stochastic Networks for Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 2015, "page_last": 2023, "abstract": "We extend generative stochastic networks to supervised learning of representations. In particular, we introduce a hybrid training objective considering a generative and discriminative cost function governed by a trade-off parameter lambda. We use a new variant of network training involving noise injection, i.e. walkback training, to jointly optimize multiple network layers. Neither additional regularization constraints, such as l1, l2 norms or dropout variants, nor pooling- or convolutional layers were added. Nevertheless, we are able to obtain state-of-the-art performance on the MNIST dataset, without using permutation invariant digits and outperform baseline models on sub-variants of the MNIST and rectangles dataset significantly.", "full_text": "General Stochastic Networks for Classi\ufb01cation\n\nMatthias Z\u00a8ohrer and Franz Pernkopf\n\nSignal Processing and Speech Communication Laboratory\n\nmatthias.zoehrer@tugraz.at, pernkopf@tugraz.at\n\nGraz University of Technology\n\nAbstract\n\nWe extend generative stochastic networks to supervised learning of representa-\ntions. In particular, we introduce a hybrid training objective considering a genera-\ntive and discriminative cost function governed by a trade-off parameter \u03bb. We use\na new variant of network training involving noise injection, i.e. walkback train-\ning, to jointly optimize multiple network layers. Neither additional regularization\nconstraints, such as (cid:96)1, (cid:96)2 norms or dropout variants, nor pooling- or convolu-\ntional layers were added. Nevertheless, we are able to obtain competitive per-\nformance on the MNIST dataset, without using permutation invariant digits and\noutperform baseline models on sub-variants of the MNIST and rectangles dataset\nsigni\ufb01cantly.\n\n1\n\nIntroduction\n\nSince 2006 there has been a boost in machine learning due to improvements in the \ufb01eld of unsu-\npervised learning of representations. Most accomplishments originate from variants of restricted\nBoltzmann machines (RBMs) [1], auto-encoders (AE) [2, 3] and sparse-coding [4, 5, 6]. Deep mod-\nels in representation learning, also obtain impressive results in supervised learning problems, such\nas speech recognition, e.g. [7, 8, 9] and computer vision tasks [10].\nIf no a-priori knowledge is modeled in the architecture, cf. convolutional layers or pooling layers\n[11], generatively pre-trained networks are among the best when applied to supervised learning tasks\n[12]. Usually, a generative representation is obtained through a greedy-layerwise training procedure\ncalled contrastive divergence (CD) [1]. In this case, the network layer learns the representation from\nthe layer below by treating the latter as static input. Despite of the impressive results achieved with\nCD, we identify two (minor) drawbacks when used for supervised learning: Firstly, after obtaining\na representation by pre-training a network, a new discriminative model is initialized with the trained\nweights, splitting the training into two separate models. This seems to be neither biologically plausi-\nble, nor optimal when it comes to optimization, as carefully designed early stopping criteria have to\nbe implemented to prevent over- or under-\ufb01tting. Secondly, generative and discriminative objectives\nmight in\ufb02uence each other bene\ufb01cially when combined during training. CD does not take this into\naccount.\nIn this work, we introduce a new training procedure for supervised learning of representations. In\nparticular we de\ufb01ne a hybrid training objective for discriminative-generative stochastic networks\n(dGSN), dividing the cost function into a generative and discriminative part, controlled by a trade-\noff parameter \u03bb.\nIt turns out that by annealing \u03bb, when solving this unconstrained non-convex\nmulti-objective optimization problem, we do not suffer from the shortcomings described above. We\nare able to obtain competitive performance on the MNIST [13] dataset, without using permutation\ninvariant digits and signi\ufb01cantly outperform baseline models on sub-variants of the MNIST and\nrectangle database [14].\nOur approach is related to the discriminative-generative training approach of RBMs [15]. However\na different model and a new variant of network training involving noise injection, i.e. walkback\n\n1\n\n\ftraining [16, 17], is used to jointly optimize multiple network layers. Most notably, we did not\napply any additional regularization constraints, such as (cid:96)1, (cid:96)2 norms or dropout variants [12], [18],\nunlocking further potential for possible optimizations. The model can be extended to learn multiple\ntasks at the same time using jointly trained weights and by introducing multiple objectives. This\nmight also open a new prospect in the \ufb01eld of transfer learning [19] and multi-task learning [20]\nbeyond classi\ufb01cation.\nThis paper is organized as follows: Section 2 presents mathematical background material i.e. the\ndGSN and a hybrid learning criterion. In Section 3 we empirically study the in\ufb02uence of hyper\nparameters of dGSNs and present experimental results. Section 4 concludes the paper and provides\na perspective on future work.\n\n2 General Stochastic Networks\n\nRecently, a new supervised learning algorithm called walkback training for generalized auto-\nencoders (GAE) was introduced [16]. A follow-up study [17] de\ufb01ned a new network model \u2013\ngenerative stochastic networks (GSN), extending the idea of walkback training to multiple layers.\nWhen applied to image reconstruction, they were able to outperform various baseline systems, due\nto its ability to learn multi-modal representations [17, 21]. In this paper, we extend the work of [17].\nFirst, we provide mathematical background material for generative stochastic networks. Then, we\nintroduce modi\ufb01cations to make the model suitable for supervised learning. In particular we present\na hybrid training objective, dividing the cost into a generative and discriminative part. This paves\nthe way for any multi-objective learning of GSNs. We also introduce a new terminology, i.e. gen-\neral stochastic networks, a model class including generative-, discriminative- and hybrid stochastic\nnetwork variants.\n\nGeneral Stochastic Networks for Unsupervised Learning\n\nRestricted Boltzmann machines (RBM) [22] and denoising autoencoders (DAE) [3] share the fol-\nlowing commonality; The input distribution P (X) is sampled to convergence in a Markov chain.\nIn the case of the DAE, the transition operator \ufb01rst samples the hidden state Ht from a corruption\ndistribution C(H|X), and generates a reconstruction from the parametrized model, i.e the density\nP\u03b82(X|H).\n\nThe resulting DAE Markov chain, shown in Figure 1, is de\ufb01ned as\n\nFigure 1: DAE Markov chain.\n\nHt+1 \u223c P\u03b81(H|Xt+0) and Xt+1 \u223c P\u03b82 (X|Ht+1),\n\n(1)\n\nwhere Xt+0 is the input sample X, fed into the chain at time step 0 and Xt+1 is the reconstruction\nof X at time step 1. In the case of a GSN, an additional dependency between the latent variables Ht\nover time is introduced to the network graph. The GSN Markov chain is de\ufb01ned as follows:\n\nHt+1 \u223c P\u03b81(H|Ht+0, Xt+0) and Xt+1 \u223c P\u03b82(X|Ht+1).\n\n(2)\n\nFigure 2 shows the corresponding network graph.\n\nThis chain can be expressed with deterministic functions of random variables f\u03b8 \u2287 { \u02c6f\u03b8, \u02c7f\u03b8}. In\nparticular, the density f\u03b8 is used to model Ht+1 = f\u03b8(Xt+0, Zt+0, Ht+0), speci\ufb01ed for some inde-\npendent noise source Zt+0, with the condition that Xt+0 cannot be recovered exactly from Ht+1.\n\n2\n\nXt+0Xt+1Xt+2Xt+3Xt+4Ht+1Ht+2Ht+3Ht+4P\u03b81P\u03b82P\u03b81P\u03b82P\u03b81P\u03b82P\u03b81P\u03b82P\u03b81\fFigure 2: GSN Markov chain.\n\n\u03b8 as a back-probable stochastic non-linearity of the form \u02c6f i\n\nWe introduce \u02c6f i\n\u03b8 = \u03b7out + g(\u03b7in + \u02c6ai)\nwith noise processes Zt \u2287 {\u03b7in, \u03b7out} for layer i. The variable \u02c6ai is the activation for unit i, where\nt + bi with a weight matrix W i and bias bi, representing the parametric distribution. It is\n\u02c6ai = W iI i\nembedded in a non-linear activation function g. The input I i\nt of observed\nt. In general, \u02c6f i\nt ) speci\ufb01es an upward path in a GSN\nsample X i\nt or the hidden realization hi\n\u03b8(Zt+0, Ht+1) we de\ufb01ne \u02c7f i\nfor a speci\ufb01c layer i. In the case of X i\nt ) = \u03b7out + g(\u03b7in + \u02c7ai)\nt + bi, using the transpose of the weight\nas a downward path in the network i.e. \u02c7ai = (W i)T H i\nmatrix W i and the bias bi. This formulation allows to directly back-propagate the reconstruc-\ntion log-likelihood P (X|H) for all parameters \u03b8 \u2287 {W 0, ..., W d, b0, ..., bd} where d is the\nnumber of hidden layers.\nIn Figure 2 the GSN includes a simple hidden layer. This can be\nextended to multiple hidden layers requiring multiple deterministic functions of random variables\n\u03b8 }.\nf\u03b8 \u2208 { \u02c6f 0\n\u03b8 , ... \u02c7f d\n\nt is either the realization xi\n\u03b8(I i\n\nt of H i\nt+1 = \u02c7f i\n\n\u03b8 , ..., \u02c6f d\n\n\u03b8 , \u02c7f 0\n\n\u03b8(H i\n\nFigure 3 visualizes the Markov chain for a multi-layer GSN, inspired by the unfolded computational\ngraph of a deep Boltzmann machine Gibbs sampling process.\n\nFigure 3: GSN Markov chain with multiple layers and backprop-able stochastic units.\n\nt+1 = \u02c6f 0\n\n\u03b8 (X 0\n\nt+1) for k = 0. In the case of k = 1, H 1\n\nIn the training case, alternatively even or odd layers are updated at the same time. The information\nis propagated both upwards and downwards for K steps allowing the network to build higher order\nrepresentations. An example for this update process is given in Figure 3. In the even update (marked\nin red) H 1\n\u03b8 (H 1\nt+2 =\n\u02c6f 1\nt+2) and H 3\n\u03b8 (H 1\nt+2)\nt+3) in the odd update.\nin the even update and X 0\n\u03b8 (H 3\nt+3) in the even update and\nIn case of k = 2, H 1\nX 0\nThe cost function of a generative GSN can be written as:\n\nt+0). In the odd update (marked in blue) X 0\nt+1) + \u02c7f 1\n\u03b8 (H 1\nt+4 = \u02c6f 2\n\nt+1 = \u02c7f 0\n\u03b8 (H 2\nt+2) + \u02c7f 2\n\u03b8 (H 2\n\n\u03b8 (X 0\nt+3 = \u02c6f 1\nt+3) and H 3\n\u03b8 (H 3\n\nt+2 = \u02c7f 0\n\u03b8 (X 0\nt+4 = \u02c6f 1\n\n\u03b8 (H 1\nt+2) + \u02c7f 1\n\u03b8 (H 1\n\nt+4) in the odd update.\n\nt+1) and H 2\nt+3 = \u02c6f 2\n\n\u03b8 (H 2\n\n\u03b8 (H 2\nt+3) + \u02c7f 2\n\nt+3 = \u02c7f 0\n\n\u03b8 (H 1\n\nt+3) and H 2\n\nt+3 = \u02c6f 0\n\nt+2 = \u02c6f 0\n\nt+2) and H 2\n\nC =\n\nLt{X 0\n\nt+k, Xt+0},\n\n(3)\n\nK(cid:88)\n\nk=1\n\n3\n\nXt+0Xt+1Xt+2Xt+3Xt+4Ht+1Ht+2Ht+3Ht+4Ht+0P\u03b81P\u03b82P\u03b81P\u03b82P\u03b81P\u03b82P\u03b81P\u03b82P\u03b81X0t+0X0t+1X0t+2X0t+3X0t+4H1t+1H1t+2H1t+3H1t+4H2t+2H2t+3H2t+4H3t+3H3t+4Xt+0Lt{X0t+1,Xt+0}Lt{X0t+2,Xt+0}Lt{X0t+3,Xt+0}Lt{X0t+4,Xt+0}\u02c6f0\u03b8\u02c7f0\u03b8\u02c6f0\u03b8\u02c7f0\u03b8\u02c6f0\u03b8\u02c6f0\u03b8\u02c7f0\u03b8\u02c6f1\u03b8\u02c7f0\u03b8\u02c6f1\u03b8\u02c7f1\u03b8\u02c7f1\u03b8\u02c6f1\u03b8\u02c7f1\u03b8\u02c6f2\u03b8\u02c7f2\u03b8\u02c6f2\u03b8\u02c7f2\u03b8\u02c6f0\u03b8\u02c6f1\u03b8\u02c6f2\u03b8\fLt is a speci\ufb01c loss-function such as the mean squared error (MSE) at time step t. In general any\narbitrary loss function could be used (as long as they can be seen as a log-likelihood) [16]. X 0\nt+k\nis the reconstruction of the input X 0\nt+0 at layer 0 after k steps. Optimizing the loss function by\nbuilding the sum over the costs of multiple corrupted reconstructions is called walkback training\n[16, 17]. This form of network training leads to a signi\ufb01cant performance boost when used for input\nreconstruction. The network is able to handle multi-modal input representations and is therefore\nconsiderably more favorable than standard generative models [16].\n\nGeneral Stochastic Networks for Supervised Learning\n\nIn order to make a GSN suitable for a supervised learning task we introduce the output Y to the\nnetwork graph. In this case L = log P (X) + log P (Y |X). Although the target Y is not fed into the\nnetwork, it is introduced as an additional cost term. The layer update-process stays the same.\n\nFigure 4: dGSN Markov chain for input Xt+0 and target Yt+0 with backprop-able stochastic units.\n\nWe de\ufb01ne the following cost function for a 3-layer dGSN:\n\nK(cid:88)\n\nk=1\n\nC =\n\n\u03bb\nK\n\n(cid:124)\n\nLt{Xt+k, Xt+0}\n\n+\n\n(cid:123)(cid:122)\n\n1 \u2212 \u03bb\n\nK \u2212 d + 1\n\n(cid:125)\n\n(cid:124)\n\nK(cid:88)\n(cid:123)(cid:122)\n\nk=d\n\nLt{H 3\n\nt+k, Yt+0\n\n}\n\n(cid:125)\n\ngenerative\n\ndiscriminative\n\n(4)\n\nThis is a non-convex multi-objective optimization problem, where \u03bb weights the generative and\ndiscriminative part of C. The parameter d speci\ufb01es the number of network layers i.e. depth of the\nnetwork. Scaling the mean loss in (4) is not mandatory, but allows to equally balance both loss terms\nwith \u03bb = 0.5 for input Xt+0 and target Yt+0 scaled to the same range. Again Figure 4 shows the\ncorresponding network graph for supervised learning with red and blue edges denoting the even and\nodd network updates.\nIn general the hybrid objective optimization criterion is not restricted to (cid:104)X, Y (cid:105), as additional input\nand output terms could be introduced to the network. This setup might be useful for transfer-learning\n[19] or multi-task scenarios [20], which is not discussed in this paper.\n\n3 Experimental Results\n\nIn order to evaluate the capabilities of dGSNs for supervised learning, we studied MNIST digits\n[13], variants of MNIST digits [14] and the rectangle datasets [14]. The \ufb01rst database consists of\n60.000 labeled training and 10.000 labeled test images of handwritten digits. The second dataset in-\ncludes 6 variants of MNIST digits, i.e. { mnist-basic, mnist-rot, mnist-back-rand, mnist-back-image,\nmnist-rot-back-image }, with additional factors of variation added to the original data. Each variant\nincludes 10.000 labeled training, 2000 labeled validation, and 50.000 labeled test images. The third\ndataset involves two subsets, i.e. { rectangle, rectangle-image }. The dataset rectangle consists of\n\n4\n\nX0t+0X0t+1X0t+2X0t+3X0t+4H1t+1H1t+2H1t+3H1t+4H2t+2H2t+3H2t+4H3t+3H3t+4Xt+0Lt{X0t+1,Xt+0}Lt{X0t+2,Xt+0}Lt{X0t+3,Xt+0}Lt{X0t+4,Xt+0}Lt{H3t+1,Yt+0}Lt{H3t+2,Yt+0}\u02c6f0\u03b8\u02c7f0\u03b8\u02c6f0\u03b8\u02c7f0\u03b8\u02c6f0\u03b8\u02c6f0\u03b8\u02c7f0\u03b8\u02c6f1\u03b8\u02c7f0\u03b8\u02c6f1\u03b8\u02c7f1\u03b8\u02c7f1\u03b8\u02c6f1\u03b8\u02c7f1\u03b8\u02c6f2\u03b8\u02c7f2\u03b8\u02c6f2\u03b8\u02c7f2\u03b8\u02c6f0\u03b8\u02c6f1\u03b8\u02c6f2\u03b8\f1000 labeled training, 200 labeled validation, and 50.000 labeled test images. The dataset rectangle-\nimage includes 10.000 labeled train, 2000 labeled validation and 50.000 labeled test images.\nIn a \ufb01rst experiment we focused on the multi-objective optimization problem de\ufb01ned in (4). Next we\nevaluated the number of walkback steps in a dGSN, necessary for convergence. In a third experiment\nwe analyzed the in\ufb02uence of different Gaussian noise settings during walkback training, improving\nthe generalization capabilities of the network. Finally we summarize classi\ufb01cation results for all\ndatasets and compare to baseline systems [14].\n\n3.1 Multi-Objective Optimization in a Hybrid Learning Setup\n\nIn order to solve the non-convex multi-objective optimization problem, variants of stochastic gradi-\nent descent (SGD) can be used. We applied a search over \ufb01xed \u03bb values on all problems. Further-\nmore, we show that the use of an annealed \u03bb factor, during training works best in practice.\nIn all experiments a three layer dGSN, i.e. dGSN-3, with 2000 neurons in each layer, randomly\ninitialized with small Gaussian noise, i.e. 0.01 \u00b7 N (0, 1), and an MSE loss function for both inputs\nand targets was used. Regarding optimization we applied SGD with a learning rate \u03b7 = 0.1, a\nmomentum term of 0.9 and a multiplicative annealing factor \u03b7n+1 = \u03b7n \u00b7 0.99 per epoch n for the\nlearning rate. A recti\ufb01er unit [23] was chosen as activation function. Following the ideas of [24] no\nexplicit sampling was applied at the input and output layer. In the test case the zero-one loss was\ncomputed averaging the network\u2019s output over k walkback steps.\n\nAnalysis of the Hybrid Learning Parameter \u03bb\n\nConcerning the in\ufb02uence of the trade-off parameter \u03bb, we tested \ufb01xed \u03bb values in the range\n\u03bb \u2208 {0.01, 0.1, 0.2, ..., 0.9, 0.99}, where low values emphasize the discriminative part in the ob-\njective and vice versa. Walkback training with K = 6 steps using zero-mean pre- and post-\nactivation Gaussian noise with zero mean and variance \u03c3 = 0.1 was performed for 500 train-\nIn a more dynamic scenario \u03bbn=1 = 1 was annealed by \u03bbn+1 = \u03bbn \u00b7 \u03c4 to reach\ning epochs.\n\u03bbn=500 \u2208 {0.01, 0.1, 0.2, ..., 0.9, 0.99} within 500 epochs, simulating generative pre-training to a\ncertain extend.\n\nFigure 5: In\ufb02uence of dynamic and static \u03bb on MNIST variants basic (left), rotated (middle) and\nbackground (right) where (cid:63) denotes the training-, (cid:52) the validation- and (cid:53) the test-set. The dashed\nline denotes the static setup, the bold line the dynamic setup.\n\nFigure 5 compares the results of both dGSNs, using static and dynamic \u03bb setups on the MNIST\nvariants basic, rotated and background. The use of a dynamic i.e. annealed \u03bbn=500 = 0.01, achieved\nthe best validation and test error in all experiments. In this case, more attention was given to the\ngenerative proportion P (X) of the objective (4) in the early stage of training. After approximately\n400 epochs discriminative training i.e. \ufb01ne-tuning, dominates. This setup is closely related to DBN\ntraining, where emphasis is on optimizing P (X) at the beginning of the optimization, whereas\nP (Y |X) is important at the last stages. In case of the dGSN, the annealed \u03bb achieves a more smooth\ntransition by shifting the weight in the optimization criterion from P (X) to P (Y |X) within one\nmodel.\n\n5\n\n\fAnalysis of Walkback Steps K\n\nIn a next experiment we tested the in\ufb02uence of K walkback steps for dGSNs. Figure 6 shows the\nresults for different dGSNs, trained with K \u2208 {6, 7, 8, 9, 10} walkback steps and annealed \u03bb with\n\u03c4 = 0.99. In all cases the information was at least propagated once up and once downwards in the\nd = 3 layer network using \ufb01xed Gaussian pre- and post-activation noise with \u00b5 = 0 and \u03c3 = 0.1.\n\nFigure 6: Evaluating the number of walkback steps on MNIST variants basic (left), rotated (middle)\nand background (right) where (cid:63) denotes the training-, (cid:52) the validation- and (cid:53) the test-set.\n\nFigure 6 shows that increasing the walkback steps, does not improve the generalization capabilities\nof the used GSNs. The setup K = 2\u00b7 d is suf\ufb01cient for convergence and achieves the best validation\nand test result in all experiments.\n\nAnalysis of Pre- and Post-Activation Noise\n\nInjecting noise during the training process of GSNs serves as a regularizer and improves the gen-\neralization capabilities of the model [17]. In this experiment the in\ufb02uence of Gaussian pre- and\npost-activation noise with \u00b5 = 0 and \u03c3 \u2208 {0.05, 0.1, 0.15, 0.2, 0.25, 0.3} and deactivated noise\nduring training, was tested on a dGSN-3 trained for K = 6 walkback steps. The trade-off factor\n\u03bb was annealed with \u03c4 = 0.99. Figure 7 summarizes the results of the different dGSNs for the\nMNIST variants basic, rotated and background. Setting \u03c3 = 0.1 achieved the best overall result on\nthe validation- and test-set for all three experiments. In all other cases the dGSNs either over- or\nunder\ufb01tted the data.\n\nFigure 7: Evaluating noise injections during training on MNIST variants basic (left), rotated (middle)\nand background (right) where (cid:63) denotes the training-, (cid:52) the validation- and (cid:53) the test-set.\n\n3.2 MNIST results\n\nTable 1 presents the average classi\ufb01cation error of three runs of all MNIST variation datasets ob-\ntained by a dGSN-3, using \ufb01xed Gaussian pre- and post-activation noise with \u00b5 = 0, \u03c3 = 0.1 and\nK = 6 walkback steps. The hybrid learning parameter \u03bb was annealed with \u03c4 = 0.99 and \u03bbn=1 = 1.\nA small grid test was performed in the range of N \u00d7 d with N \u2208 {1000, 2000, 3000} neurons per\nlayer for d \u2208 {1, 2, 3} layers to \ufb01nd the optimal network con\ufb01guration.\n\n6\n\n\fDataset\n\nSVMrbf\n\nSVMpoly NNet\n\nDBN-1\n\nSAA-3\n\nDBN-3\n\ndGSN-3\n\nmnist-basic\n\nmnist-rot*\n\nmnist-back-rand\n\nmnist-back-image\n\nmnist-rot-back-image*\n\nrectangles\n\nrectangles-image\n\n3.03\u00b10.15\n11.11\u00b10.28\n14.58\u00b10.31\n22.61\u00b10.37\n55.18\u00b10.44\n2.15\u00b10.13\n24.04\u00b10.37\n\n3.69\u00b10.17\n15.42\u00b10.32\n16.62\u00b10.33\n24.01\u00b10.37\n56.41\u00b10.43\n2.15\u00b10.13\n24.05\u00b10.37\n\n4.69\u00b10.19\n18.11\u00b10.34\n20.04\u00b10.35\n27.41\u00b10.39\n62.16\u00b10.43\n7.16\u00b10.23\n33.20\u00b10.41\n\n3.94\u00b10.17\n10.30\u00b10.27\n9.80\u00b10.26\n16.15\u00b10.32\n47.39\u00b10.44\n4.71\u00b10.19\n23.69\u00b10.37\n\n3.46\u00b10.16\n10.30\u00b10.27\n11.28\u00b10.28\n23.00\u00b10.37\n51.93\u00b10.44\n2.41\u00b10.13\n24.05\u00b10.37\n\n3.11\u00b10.15\n14.69\u00b10.31\n6.73\n\u00b10.22\n16.31\u00b10.32\n52.21\u00b10.44\n2.60\u00b10.14\n22.50\u00b10.37\n\n2.40\n\u00b10.04\n8.66\n\u00b10.08\n9.38\u00b10.03\n16.04\n\u00b10.04\n43.86\n\u00b10.05\n2.04\n\u00b10.04\n22.10\n\u00b10.03\n\nTable 1: MNIST variations and recangle results [14]; For datasets marked by (*) updated results are\nshown [25].\n\nTable 1 shows that a three layer dGSN clearly outperforms all other models, except for the MNIST\nrandom-background dataset. In particular, when comparing the dGSN-3 to the radial basis function\nsupport vector machine (SVMrbf), i.e. the second best model on MNIST basic, the dGSN-3 achieved\nan relative improvement of 20.79% on the test set. On the MNIST rotated dataset the dGSN-3 was\nable to beat the second best model i.e. DBN-1, by 15.92% on the test set. On the MNIST rotated-\nbackground there is an relative improvement of 7.25% on the test set between the second best model,\ni.e. DBN-1, and the dGSN-3. All results are statistically signi\ufb01cant. Regarding the number of model\nparameters, although we cannot directly compare the models in terms of network parameters, it is\nworth to mention that a far smaller grid test was used to generate the results for all dGSNs, cf.\n[14]. When comparing the classi\ufb01cation error of the dGSN-3 trained without noise, obtained in\nthe previous experiments (7) with Table 1, the dGSN-3 achieved the test error of 2.72% on the\nMNIST variant basic, outperforming all other models on this task. On the MNIST variant rotated,\nthe dGSN-3 also outperformed the DBN-3, obtaining a test error of 11.2%. This indicates that not\nonly the Gaussian regularizer in the walkback training improves the generalization capabilities of\nthe network, but also the hybrid training criterion of the dGSN.\nTable 2 lists the results for the MNIST dataset without additional af\ufb01ne transformations applied to\nthe data i.e. permutation invariant digits. A three layer dGSN achieved a test error of 0.95%.\n\nNetwork\n\nRecti\ufb01er MLP + dropout [12]\nDBM [26]\ndGSN-3\nMaxout MLP + dropout [27]\nMP-DBM [28]\nDeep Convex Network [29]\nManifold Tangent Classi\ufb01er [30]\nDBM + dropout [12]\n\nTable 2: MNIST results.\n\nResult\n\n1.05%\n0.95%\n0.95%\n0.94%\n0.91%\n0.83%\n0.81%\n0.79%\n\n7\n\n\fIt might be worth noting that in addition to the noise process in walkback training, no other regu-\nlarizers, such as (cid:96)1, (cid:96)2 norms and dropout variants [12], [18] were used in the dGSNs. In general\n\u2264 800 training epochs with early-stopping are necessary for dGSN training.\nAll simulations1 were executed on a GPU with the help of the mathematical expression compiler\nTheano [31].\n\n4 Conclusions and Future Work\n\nWe have extended GSNs for classi\ufb01cation problems.\nIn particular we de\ufb01ned an hybrid multi-\nobjective training criterion for GSNs, dividing the cost function into a generative and discriminative\npart. This renders the need for generative pre-training unnecessary. We analyzed the in\ufb02uence of\nthe objective\u2019s trade-off parameter \u03bb empirically, showing that by annealing \u03bb we outperform a\nstatic choice of \u03bb. Furthermore, we discussed effects of noise injections and sampling steps during\nwalkback training. As a conservative starting point we restricted the model to use only recti\ufb01er\nunits. Neither additional regularization constraints, such as (cid:96)1, (cid:96)2 norms or dropout variants [12],\n[18], nor pooling- [11, 32] or convolutional layers [11] were added. Nevertheless, the GSN was\nable to outperform various baseline systems, in particular a deep belief network (DBN), a multi\nlayer perceptron (MLP), a support vector machine (SVM) and a stacked auto-associator (SSA), on\nvariants of the MNIST dataset. Furthermore, we also achieved state-of-the-art performance on the\noriginal MNIST dataset without permutation invariant digits. The model not only converges faster\nin terms of training iterations, but also show better generalization behavior in most cases. Our\napproach opens a wide \ufb01eld of new applications for GSNs. In future research we explore adaptive\nnoise injection methods for GSNs and non-convex multi-objective optimization strategies.\n\nReferences\n[1] G. E. Hinton, S. Osindero, and Y. Teh, \u201cA fast learning algorithm for deep belief nets.\u201d Neural computa-\n\ntion, vol. 18, no. 7, pp. 1527\u20131554, 2006.\n\n[2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, \u201cGreedy layer-wise training of deep networks,\u201d in\n\nAdvances in Neural Information Processing Systems (NIPS), 2007, pp. 153\u2013160.\n\n[3] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol, \u201cExtracting and composing robust features with\ndenoising autoencoders,\u201d in International Conference on Machine Learning (ICML), 2008, pp. 1096\u2013\n1103.\n\n[4] H. Lee, A. Battle, R. Raina, and A. Y. Ng, \u201cEf\ufb01cient sparse coding algorithms,\u201d in Advances in Neural\n\nInformation Processing Systems (NIPS), 2007, pp. 801\u2013808.\n\n[5] J. Ngiam, Z. Chen, S. A. Bhaskar, P. W. Koh, and A. Y. Ng, \u201cSparse \ufb01ltering,\u201d in Advances in Neural\n\nInformation Processing Systems (NIPS), 2011, pp. 1125\u20131133.\n\n[6] M. Ranzato, M. Poultney, S. Chopra, and Y. LeCun, \u201cEf\ufb01cient learning of sparse representations with an\nenergy-based model,\u201d in Advances in Neural Information Processing Systems (NIPS), 2006, pp. 1137\u2013\n1144.\n\n[7] G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, \u201cPhone recognition with the mean-covariance\nrestricted Boltzmann machine,\u201d in Advances in Neural Information Processing Systems (NIPS), 2010, pp.\n469\u2013477.\n\n[8] L. Deng, M. L. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. E. Hinton, \u201cBinary coding of speech\n\nspectrograms using a deep auto-encoder.\u201d in Interspeech, 2010, pp. 1692\u20131695.\n\n[9] F. Seide, G. Li, and D. Yu, \u201cConversational speech transcription using context-dependent deep neural\n\nnetworks.\u201d in Interspeech, 2011, pp. 437\u2013440.\n\n[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convolutional neural\n\nnetworks,\u201d in Advances in Neural Information Processing Systems (NIPS), 2012, pp. 1097\u20131105.\n\n[11] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document recogni-\n\ntion,\u201d Proceedings of the IEEE, vol. 86, no. 11, 1998.\n\n[12] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, \u201cImproving neural net-\n\nworks by preventing co-adaptation of feature detectors,\u201d CoRR, vol. abs/1207.0580, 2012.\n\n1The code will be made publicly available for reproducing the results.\n\n8\n\n\f[13] Y. Lecun and C. Cortes, \u201cThe MNIST database of handwritten digits,\u201d 2014. [Online]. Available:\n\nhttp://yann.lecun.com/exdb/mnist/\n\n[14] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, \u201cAn empirical evaluation of deep archi-\ntectures on problems with many factors of variation,\u201d in International Conference on Machine Learning\n(ICML), 2007, pp. 473\u2013480.\n\n[15] H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio, \u201cLearning algorithms for the classi\ufb01cation re-\nstricted Boltzmann machine,\u201d Journal of Machine Learning Research (JMLR), vol. 13, pp. 643\u2013669,\n2012.\n\n[16] Y. Bengio, L. Yao, G. Alain, and P. Vincent, \u201cGeneralized denoising auto-encoders as generative models,\u201d\n\nin Advances in Neural Information Processing Systems (NIPS), 2013, pp. 899\u2013907.\n\n[17] Y. Bengio, E. Thibodeau-Laufer, and J. Yosinski, \u201cDeep generative stochastic networks trainable by back-\n\nprop,\u201d CoRR, vol. abs/1306.1091, 2013.\n\n[18] L. Wan and M. Zeiler, \u201cRegularization of neural networks using dropconnect,\u201d in International Confer-\n\nence on Machine Learning (ICML), 2013, pp. 109\u2013111.\n\n[19] G. Mesnil, Y. Dauphin, X. Glorot, S. Rifai, Y. Bengio, I. J. Goodfellow, E. Lavoie, X. Muller, G. Des-\njardins, D. Warde-Farley, P. Vincent, A. Courville, and J. Bergstra, \u201cUnsupervised and transfer learning\nchallenge: a deep learning approach,\u201d in Unsupervised and Transfer Learning challenge and workshop\n(JMLR W& CP), 2012, pp. 97\u2013110.\n\n[20] K. Abhishek and D. Hal, \u201cLearning task grouping and overlap in multi-task learning,\u201d in International\n\nConference on Machine Learning (ICML), 2012.\n\n[21] S. Ozair, L. Yao, and Y. Bengio, \u201cMultimodal transitions for generative stochastic networks.\u201d CoRR, vol.\n\nabs/1312.5578, 2013.\n\n[22] P. Smolensky, Information processing in dynamical systems: Foundations of harmony theory. MIT\n\nPress, 1986, vol. 1, no. 1, pp. 194\u2013281.\n\n[23] X. Glorot, A. Bordes, and Y. Bengio, \u201cDeep sparse recti\ufb01er neural networks,\u201d in International Conference\n\non Arti\ufb01cial Intelligence and Statisitics (AISTATS), 2011, pp. 315\u2013323.\n\n[24] G. E. Hinton, \u201cA practical guide to training restricted boltzmann machines,\u201d in Neural Networks: Tricks\n\nof the Trade (2nd ed.), ser. Lecture Notes in Computer Science. Springer, 2012, pp. 599\u2013619.\n\n[25] H. Larochelle, D. Erhan, A. Courville,\n\nJ. Bergstra, and Y. Bengio, \u201cOnline companion for\nthe paper an empirical evaluation of deep architectures on problems with many factors of\nhttp://www.iro.umontreal.ca/\u223clisa/twiki/bin/view.cgi/Public/\nvariation,\u201d 2014.\n[Online]. Available:\nDeepVsShallowComparisonICML2007\n\n[26] R. Salakhutdinov and G. E. Hinton, \u201cDeep boltzmann machines,\u201d in International Conference on Arti\ufb01cial\n\nIntelligence and Statistics (AISTATS), 2009, pp. 448\u2013455.\n\n[27] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, \u201cMaxout networks,\u201d in Inter-\n\nnational Conference on Machine Learning (ICML), 2013, pp. 1319\u20131327.\n\n[28] I. J. Goodfellow, A. C. Courville, and Y. Bengio, \u201cJoint training deep boltzmann machines for classi\ufb01ca-\n\ntion,\u201d CoRR, vol. abs/1301.3568, 2013.\n\n[29] D. Yu and L. Deng, \u201cDeep convex net: A scalable architecture for speech pattern classi\ufb01cation.\u201d in Inter-\n\nspeech, 2011, pp. 2285\u20132288.\n\n[30] S. Rifai, Y. Dauphin, P. Vincent, Y. Bengio, and X. Muller, \u201cThe manifold tangent classi\ufb01er,\u201d in Advances\n\nin Neural Information Processing Systems (NIPS), 2012, pp. 2294\u20132302.\n\n[31] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley,\nand Y. Bengio, \u201cTheano: a CPU and GPU math expression compiler,\u201d in Python for Scienti\ufb01c Computing\nConference (SciPy), 2010.\n\n[32] M. Zeiler and R. Fergus, \u201cStochastic pooling for regularization of deep convolutional neural networks,\u201d\n\nCoRR, vol. abs/1301.3557, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1090, "authors": [{"given_name": "Matthias", "family_name": "Z\u00f6hrer", "institution": "Graz University of Technology"}, {"given_name": "Franz", "family_name": "Pernkopf", "institution": "Signal Processing and Speech Communication Laboratory, Graz, Austria"}]}