{"title": "Semi-supervised Learning with Ladder Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3546, "page_last": 3554, "abstract": "We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on top of the Ladder network proposed by Valpola (2015) which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification in addition to permutation-invariant MNIST classification with all labels.", "full_text": "Semi-Supervised Learning with Ladder Networks\n\nAntti Rasmus and Harri Valpola\nThe Curious AI Company, Finland\n\nMikko Honkala\n\nNokia Labs, Finland\n\nMathias Berglund and Tapani Raiko\n\nAalto University, Finland & The Curious AI Company, Finland\n\nAbstract\n\nWe combine supervised learning with unsupervised learning in deep neural net-\nworks. The proposed model is trained to simultaneously minimize the sum of su-\npervised and unsupervised cost functions by backpropagation, avoiding the need\nfor layer-wise pre-training. Our work builds on top of the Ladder network pro-\nposed by Valpola [1] which we extend by combining the model with supervi-\nsion. We show that the resulting model reaches state-of-the-art performance in\nsemi-supervised MNIST and CIFAR-10 classi\ufb01cation in addition to permutation-\ninvariant MNIST classi\ufb01cation with all labels.\n\n1\n\nIntroduction\n\nIn this paper, we introduce an unsupervised learning method that \ufb01ts well with supervised learning.\nCombining an auxiliary task to help train a neural network was proposed by Suddarth and Kergosien\n[2]. There are multiple choices for the unsupervised task, for example reconstruction of the inputs\nat every level of the model [e.g., 3] or classi\ufb01cation of each input sample into its own class [4].\nAlthough some methods have been able to simultaneously apply both supervised and unsupervised\nlearning [3, 5], often these unsupervised auxiliary tasks are only applied as pre-training, followed\nby normal supervised learning [e.g., 6]. In complex tasks there is often much more structure in\nthe inputs than can be represented, and unsupervised learning cannot, by de\ufb01nition, know what\nwill be useful for the task at hand. Consider, for instance, the autoencoder approach applied to\nnatural images: an auxiliary decoder network tries to reconstruct the original input from the internal\nrepresentation. The autoencoder will try to preserve all the details needed for reconstructing the\nimage at pixel level, even though classi\ufb01cation is typically invariant to all kinds of transformations\nwhich do not preserve pixel values.\nOur approach follows Valpola [1] who proposed a Ladder network where the auxiliary task is to\ndenoise representations at every level of the model. The model structure is an autoencoder with\nskip connections from the encoder to decoder and the learning task is similar to that in denoising\nautoencoders but applied at every layer, not just the inputs. The skip connections relieve the pressure\nto represent details at the higher layers of the model because, through the skip connections, the\ndecoder can recover any details discarded by the encoder. Previously the Ladder network has only\nbeen demonstrated in unsupervised learning [1, 7] but we now combine it with supervised learning.\nThe key aspects of the approach are as follows:\nCompatibility with supervised methods. The unsupervised part focuses on relevant details found\nby supervised learning. Furthermore, it can be added to existing feedforward neural networks, for\nexample multi-layer perceptrons (MLPs) or convolutional neural networks (CNNs).\n\n1\n\n\fScalability due to local learning. In addition to supervised learning target at the top layer, the\nmodel has local unsupervised learning targets on every layer making it suitable for very deep neural\nnetworks. We demonstrate this with two deep supervised network architectures.\nComputational ef\ufb01ciency. The encoder part of the model corresponds to normal supervised learn-\ning. Adding a decoder, as proposed in this paper, approximately triples the computation during train-\ning but not necessarily the training time since the same result can be achieved faster due to better\nutilization of available information. Overall, computation per update scales similarly to whichever\nsupervised learning approach is used, with a small multiplicative factor.\nAs explained in Section 2, the skip connections and layer-wise unsupervised targets effectively turn\nautoencoders into hierarchical latent variable models which are known to be well suited for semi-\nsupervised learning. Indeed, we obtain state-of-the-art results in semi-supervised learning in the\nMNIST, permutation invariant MNIST and CIFAR-10 classi\ufb01cation tasks (Section 4). However,\nthe improvements are not limited to semi-supervised settings: for the permutation invariant MNIST\ntask, we also achieve a new record with the normal full-labeled setting.For a longer version of this\npaper with more complete descriptions, please see [8].\n\n2 Derivation and justi\ufb01cation\n\nLatent variable models are an attractive approach to semi-supervised learning because they can com-\nbine supervised and unsupervised learning in a principled way. The only difference is whether the\nclass labels are observed or not. This approach was taken, for instance, by Goodfellow et al. [5] with\ntheir multi-prediction deep Boltzmann machine. A particularly attractive property of hierarchical la-\ntent variable models is that they can, in general, leave the details for the lower levels to represent,\nallowing higher levels to focus on more invariant, abstract features that turn out to be relevant for\nthe task at hand.\nThe training process of latent variable models can typically be split into inference and learning, that\nis, \ufb01nding the posterior probability of the unobserved latent variables and then updating the under-\nlying probability model to better \ufb01t the observations. For instance, in the expectation-maximization\n(EM) algorithm, the E-step corresponds to \ufb01nding the expectation of the latent variables over the\nposterior distribution assuming the model \ufb01xed and M-step then maximizes the underlying proba-\nbility model assuming the expectation \ufb01xed.\nThe main problem with latent variable models is how to make inference and learning ef\ufb01cient. Sup-\npose there are layers l of latent variables z(l). Latent variable models often represent the probability\ndistribution of all the variables explicitly as a product of terms, such as p(z(l) | z(l+1)) in directed\ngraphical models. The inference process and model updates are then derived from Bayes\u2019 rule, typ-\nically as some kind of approximation. Often the inference is iterative as it is generally impossible to\nsolve the resulting equations in a closed form as a function of the observed variables.\nThere is a close connection between denoising and probabilistic modeling. On the one hand, given\na probabilistic model, you can compute the optimal denoising. Say you want to reconstruct a latent\nz using a prior p(z) and an observation \u02dcz = z + noise. We \ufb01rst compute the posterior distribution\np(z | \u02dcz), and use its center of gravity as the reconstruction \u02c6z. One can show that this minimizes\nthe expected denoising cost (\u02c6z z)2. On the other hand, given a denoising function, one can draw\nsamples from the corresponding distribution by creating a Markov chain that alternates between\ncorruption and denoising [9].\nValpola [1] proposed the Ladder network where the inference process itself can be learned by using\nthe principle of denoising which has been used in supervised learning [10], denoising autoencoders\n(dAE) [11] and denoising source separation (DSS) [12] for complementary tasks. In dAE, an au-\ntoencoder is trained to reconstruct the original observation x from a corrupted version \u02dcx. Learning\nis based simply on minimizing the norm of the difference of the original x and its reconstruction \u02c6x\nfrom the corrupted \u02dcx, that is the cost is k\u02c6x xk2.\nWhile dAEs are normally only trained to denoise the observations, the DSS framework is based on\nthe idea of using denoising functions \u02c6z = g(z) of latent variables z to train a mapping z = f (x)\nwhich models the likelihood of the latent variables as a function of the observations. The cost\nfunction is identical to that used in a dAE except that latent variables z replace the observations x,\n\n2\n\n\fn\na\ne\nl\nC\n\n3\n\n2\n\n1\n\n0\n\n-1\n\n-2\n\n-2\n\n-1\n\n1\n\n0\nCorrupted\n\n2\n\n3\n\n4\n\n\u02dcy\n\nx\n\nf (2)(\u00b7)\n\nf (1)(\u00b7)\n\nN (0, 2)\n\nN (0, 2)\n\nN (0, 2)\n\n\u02dcz(2)\n\ng(2)(\u00b7,\u00b7)\n\n\u02c6z(2)\n\nC(2)\n\nd\n\nf (2)(\u00b7)\n\n\u02dcz(1)\n\ng(1)(\u00b7,\u00b7)\n\n\u02c6z(1)\n\nC(1)\n\nd\n\nf (1)(\u00b7)\n\n\u02dcx\n\ng(0)(\u00b7,\u00b7)\n\n\u02c6x\n\nC(0)\n\nd\n\ny\n\nx\n\nz(2)\n\nz(1)\n\nx\n\nFigure 1: Left: A depiction of an optimal denoising function for a bimodal distribution. The input\nfor the function is the corrupted value (x axis) and the target is the clean value (y axis). The denoising\nfunction moves values towards higher probabilities as show by the green arrows. Right: A concep-\ntual illustration of the Ladder network when L = 2. The feedforward path (x ! z(1) ! z(2) ! y)\nshares the mappings f (l) with the corrupted feedforward path, or encoder (x ! \u02dcz(1) ! \u02dcz(2) ! \u02dcy).\nThe decoder (\u02dcz(l) ! \u02c6z(l) ! \u02c6x) consists of denoising functions g(l) and has cost functions C(l)\nd on\neach layer trying to minimize the difference between \u02c6z(l) and z(l). The output \u02dcy of the encoder can\nalso be trained to match available labels t(n).\n\nthat is, the cost is k\u02c6zzk2. The only thing to keep in mind is that z needs to be normalized somehow\nas otherwise the model has a trivial solution at z = \u02c6z = constant. In a dAE, this cannot happen as\nthe model cannot change the input x.\nFigure 1 (left) depicts the optimal denoising function \u02c6z = g(\u02dcz) for a one-dimensional bimodal\ndistribution which could be the distribution of a latent variable inside a larger model. The shape of\nthe denoising function depends on the distribution of z and the properties of the corruption noise.\nWith no noise at all, the optimal denoising function would be the identity function. In general, the\ndenoising function pushes the values towards higher probabilities as shown by the green arrows.\nFigure 1 (right) shows the structure of the Ladder network. Every layer contributes to the cost\nfunction a term C(l)\nd = kz(l) \u02c6z(l)k2 which trains the layers above (both encoder and decoder)\nto learn the denoising function \u02c6z(l) = g(l)(\u02dcz(l), \u02c6z(l+1)) which maps the corrupted \u02dcz(l) onto the\ndenoised estimate \u02c6z(l). As the estimate \u02c6z(l) incorporates all prior knowledge about z, the same cost\nfunction term also trains the encoder layers below to \ufb01nd cleaner features which better match the\nprior expectation.\nSince the cost function needs both the clean z(l) and corrupted \u02dcz(l), during training the encoder is\nrun twice: a clean pass for z(l) and a corrupted pass for \u02dcz(l). Another feature which differentiates the\nLadder network from regular dAEs is that each layer has a skip connection between the encoder and\ndecoder. This feature mimics the inference structure of latent variable models and makes it possible\nfor the higher levels of the network to leave some of the details for lower levels to represent. Rasmus\net al. [7] showed that such skip connections allow dAEs to focus on abstract invariant features on the\nhigher levels, making the Ladder network a good \ufb01t with supervised learning that can select which\ninformation is relevant for the task at hand.\nOne way to picture the Ladder network is to consider it as a collection of nested denoising autoen-\ncoders which share parts of the denoising machinery between each other. From the viewpoint of the\nautoencoder at layer l, the representations on the higher layers can be treated as hidden neurons. In\nother words, there is no particular reason why \u02c6z(l+i) produced by the decoder should resemble the\ncorresponding representations z(l+i) produced by the encoder. It is only the cost function C(l+i)\nthat ties these together and forces the inference to proceed in a reverse order in the decoder. This\nsharing helps a deep denoising autoencoder to learn the denoising process as it splits the task into\nmeaningful sub-tasks of denoising intermediate representations.\n\nd\n\n3\n\n\fAlgorithm 1 Calculation of the output y and cost function C of the Ladder network\nRequire: x(n)\n# Corrupted encoder and classi\ufb01er\n\u02dch(0) \u02dcz(0) x(n) + noise\nfor l = 1 to L do\n\u02dcz(l) batchnorm(W(l) \u02dch(l1)) + noise\n\u02dch(l) activation((l) (\u02dcz(l) + (l)))\nend for\nP (\u02dcy | x) \u02dch(L)\n# Clean encoder (for denoising targets)\nh(0) z(0) x(n)\nfor l = 1 to L do\nz(l)\npre W(l)h(l1)\n\u00b5(l) batchmean(z(l)\npre)\n(l) batchstd(z(l)\npre)\nz(l) batchnorm(z(l)\npre)\nh(l) activation((l) (z(l) + (l)))\nend for\n\n# Final classi\ufb01cation:\nP (y | x) h(L)\n# Decoder and denoising\nfor l = L to 0 do\nif l = L then\nu(L) batchnorm(\u02dch(L))\nelse\nu(l) batchnorm(V(l+1)\u02c6z(l+1))\nend if\n8i : \u02c6z(l)\n8i : \u02c6z(l)\nend for\n# Cost function C for training:\nC 0\nif t(n) then\n\nC log P (\u02dcy = t(n) | x(n))\n\ni g(\u02dcz(l)\ni,BN \u02c6z(l)\n\n, u(l)\ni )\ni\ni \u00b5(l)\n(l)\ni\n\nend if\n\n2\n\ni\n\nC C +PL\n\nl=0 lz(l) \u02c6z(l)\nBN\n\n3\n\nImplementation of the Model\n\ni ) + \u00b5i(u(l)\n\ni )\u2318 i(u(l)\n\n\u21e3\u02dcz(l)\ni \u00b5i(u(l)\n\nWe implement the Ladder network for fully connected MLP networks and for convolutional net-\nworks. We used standard recti\ufb01er networks with batch normalization applied to each preactivation.\nThe feedforward pass of the full Ladder network is listed in Algorithm 1.\nIn the decoder, we parametrize the denoising function such that it supports denoising of condi-\ntionally independent Gaussian latent variables, conditioned on the activations \u02c6z(l+1) of the layer\nabove. The denoising function g is therefore coupled into components \u02c6z(l)\ni ) =\npropagates information from \u02c6z(l+1) by u(l) =\ni ) are modeled as expressive nonlin-\n5,i, with the form of the nonlinearity\ni ). The decoder has thus 10 unit-wise parameters a, compared to the two parame-\n\ni ) where u(l)\nbatchnorm(V(l+1)\u02c6z(l+1)) . The functions \u00b5i(u(l)\nearities: \u00b5i(u(l)\nsimilar for i(u(l)\nters ( and [13]) in the encoder.\nIt is worth noting that a simple special case of the decoder is a model where l = 0 when l < L.\nThis corresponds to a denoising cost only on the top layer and means that most of the decoder can\nbe omitted. This model, which we call the -model due to the shape of the graph, is useful as it can\neasily be plugged into any feedforward network without decoder implementation.\nFurther implementation details of the model can be found in the supplementary material or Ref. [8].\n\n1,isigmoid(a(l)\n\ni ) and i(u(l)\n\ni = gi(\u02dcz(l)\n\ni ) = a(l)\n\n2,iu(l)\n\ni + a(l)\n\n3,i) + a(l)\n\n4,iu(l)\n\ni + a(l)\n\n, u(l)\n\ni\n\ni\n\n4 Experiments\n\nWe ran experiments both with the MNIST and CIFAR-10 datasets, where we attached the decoder\nboth to fully-connected MLP networks and to convolutional neural networks. We also compared the\nperformance of the simpler -model (Sec. 3) to the full Ladder network.\nWith convolutional networks, our focus was exclusively on semi-supervised learning. We make\nclaims neither about the optimality nor the statistical signi\ufb01cance of the supervised baseline results.\nWe used the Adam optimization algorithm [14]. The initial learning rate was 0.002 and it was\ndecreased linearly to zero during a \ufb01nal annealing phase. The minibatch size was 100. The source\ncode for all the experiments is available at https://github.com/arasmus/ladder.\n\n4\n\n\fTable 1: A collection of previously reported MNIST test errors in the permutation invariant setting\nfollowed by the results with the Ladder network. * = SVM. Standard deviation in parenthesis.\n\nTest error % with # of used labels\nSemi-sup. Embedding [15]\nTransductive SVM [from 15]\nMTC [16]\nPseudo-label [17]\nAtlasRBF [18]\nDGN [19]\nDBM, Dropout [20]\nAdversarial [21]\nVirtual Adversarial [22]\nBaseline: MLP, BN, Gaussian noise\n-model (Ladder with only top-level cost)\nLadder, only bottom-level cost\nLadder, full\n\n100\n16.86\n16.81\n12.03\n10.49\n8.10 (\u00b1 0.95)\n3.33 (\u00b1 0.14)\n\n2.12\n21.74 (\u00b1 1.77)\n3.06 (\u00b1 1.44)\n1.09 (\u00b10.32)\n1.06 (\u00b1 0.37)\n\n1000\n5.73\n5.38\n3.64\n3.46\n3.68 (\u00b1 0.12)\n2.40 (\u00b1 0.02)\n\n1.32\n5.70 (\u00b1 0.20)\n1.53 (\u00b1 0.10)\n0.90 (\u00b1 0.05)\n0.84 (\u00b1 0.08)\n\nAll\n1.5\n1.40*\n0.81\n\n1.31\n0.96\n0.79\n0.78\n0.64 (\u00b1 0.03)\n0.80 (\u00b1 0.03)\n0.78 (\u00b1 0.03)\n0.59 (\u00b1 0.03)\n0.57 (\u00b1 0.02)\n\n4.1 MNIST dataset\n\nFor evaluating semi-supervised learning, we randomly split the 60 000 training samples into 10 000-\nsample validation set and used M = 50 000 samples as the training set. From the training set, we\nrandomly chose N = 100, 1000, or all labels for the supervised cost.1 All the samples were used\nfor the decoder which does not need the labels. The validation set was used for evaluating the model\nstructure and hyperparameters. We also balanced the classes to ensure that no particular class was\nover-represented. We repeated the training 10 times varying the random seed for the splits.\nAfter optimizing the hyperparameters, we performed the \ufb01nal test runs using all the M = 60 000\ntraining samples with 10 different random initializations of the weight matrices and data splits. We\ntrained all the models for 100 epochs followed by 50 epochs of annealing.\n\n4.1.1 Fully-connected MLP\n\nA useful test for general learning algorithms is the permutation invariant MNIST classi\ufb01cation task.\nWe chose the layer sizes of the baseline model to be 784-1000-500-250-250-250-10.\nThe hyperparameters we tuned for each model are the noise level that is added to the inputs and\nto each layer, and denoising cost multipliers (l). We also ran the supervised baseline model with\nvarious noise levels. For models with just one cost multiplier, we optimized them with a search\ngrid {. . ., 0.1, 0.2, 0.5, 1, 2, 5, 10, . . .}. Ladder networks with a cost function on all layers have a\nmuch larger search space and we explored it much more sparsely. For the complete set of selected\ndenoising cost multipliers and other hyperparameters, please refer to the code.\nThe results presented in Table 1 show that the proposed method outperforms all the previously\nreported results. Encouraged by the good results, we also tested with N = 50 labels and got a test\nerror of 1.62 % (\u00b1 0.65 %).\nThe simple -model also performed surprisingly well, particularly for N = 1000 labels. With\nN = 100 labels, all models sometimes failed to converge properly. With bottom level or full cost\nin Ladder, around 5 % of runs result in a test error of over 2 %. In order to be able to estimate the\naverage test error reliably in the presence of such random outliers, we ran 40 instead of 10 test runs\nwith random initializations.\n\n1In all the experiments, we were careful not to optimize any parameters, hyperparameters, or model choices\nbased on the results on the held-out test samples. As is customary, we used 10 000 labeled validation samples\neven for those settings where we only used 100 labeled samples for training. Obviously this is not something\nthat could be done in a real case with just 100 labeled samples. However, MNIST classi\ufb01cation is such an easy\ntask even in the permutation invariant case that 100 labeled samples there correspond to a far greater number\nof labeled samples in many other datasets.\n\n5\n\n\fTable 2: CNN results for MNIST\n\nTest error without data augmentation % with # of used labels\nEmbedCNN [15]\nSWWAE [24]\nBaseline: Conv-Small, supervised only\nConv-FC\nConv-Small, -model\n\nall\n\n0.71\n0.36\n\n100\n7.75\n9.17\n6.43 (\u00b1 0.84)\n0.99 (\u00b1 0.15)\n0.89 (\u00b1 0.50)\n\n4.1.2 Convolutional networks\n\nWe tested two convolutional networks for the general MNIST classi\ufb01cation task and focused on the\n100-label case. The \ufb01rst network was a straight-forward extension of the fully-connected network\ntested in the permutation invariant case. We turned the \ufb01rst fully connected layer into a convolution\nwith 26-by-26 \ufb01lters, resulting in a 3-by-3 spatial map of 1000 features. Each of the 9 spatial loca-\ntions was processed independently by a network with the same structure as in the previous section,\n\ufb01nally resulting in a 3-by-3 spatial map of 10 features. These were pooled with a global mean-\npooling layer. We used the same hyperparameters that were optimal for the permutation invariant\ntask. In Table 2, this model is referred to as Conv-FC.\nWith the second network, which was inspired by ConvPool-CNN-C from Springenberg et al. [23],\nwe only tested the -model. The exact architecture of this network is detailed in the supplementary\nmaterial or Ref. [8]. It is referred to as Conv-Small since it is a smaller version of the network used\nfor CIFAR-10 dataset.\nThe results in Table 2 con\ufb01rm that even the single convolution on the bottom level improves the\nresults over the fully connected network. More convolutions improve the -model signi\ufb01cantly al-\nthough the variance is still high. The Ladder network with denoising targets on every level converges\nmuch more reliably. Taken together, these results suggest that combining the generalization ability\nof convolutional networks2 and ef\ufb01cient unsupervised learning of the full Ladder network would\nhave resulted in even better performance but this was left for future work.\n\n4.2 Convolutional networks on CIFAR-10\n\nThe CIFAR-10 dataset consists of small 32-by-32 RGB images from 10 classes. There are 50 000\nlabeled samples for training and 10 000 for testing. We decided to test the simple -model with\nthe convolutional architecture ConvPool-CNN-C by Springenberg et al. [23]. The main differences\nto ConvPool-CNN-C are the use of Gaussian noise instead of dropout and the convolutional per-\nchannel batch normalization following Ioffe and Szegedy [25]. For a more detailed description of\nthe model, please refer to model Conv-Large in the supplementary material.\nThe hyperparameters (noise level, denoising cost multipliers and number of epochs) for all models\nwere optimized using M = 40 000 samples for training and the remaining 10 000 samples for\nvalidation. After the best hyperparameters were selected, the \ufb01nal model was trained with these\nsettings on all the M = 50 000 samples. All experiments were run with with 4 different random\ninitializations of the weight matrices and data splits. We applied global contrast normalization and\nwhitening following Goodfellow et al. [26], but no data augmentation was used.\nThe results are shown in Table 3. The supervised reference was obtained with a model closer to the\noriginal ConvPool-CNN-C in the sense that dropout rather than additive Gaussian noise was used\nfor regularization.3 We spent some time in tuning the regularization of our fully supervised baseline\nmodel for N = 4 000 labels and indeed, its results exceed the previous state of the art. This tuning\nwas important to make sure that the improvement offered by the denoising target of the -model is\n\n2In general, convolutional networks excel in the MNIST classi\ufb01cation task. The performance of the fully\nsupervised Conv-Small with all labels is in line with the literature and is provided as a rough reference only\n(only one run, no attempts to optimize, not available in the code package).\n\n3Same caveats hold for this fully supervised reference result for all labels as with MNIST: only one run, no\n\nattempts to optimize, not available in the code package.\n\n6\n\n\fTable 3: Test results for CNN on CIFAR-10 dataset without data augmentation\n\nTest error % with # of used labels\nAll-Convolutional ConvPool-CNN-C [23]\nSpike-and-Slab Sparse Coding [27]\nBaseline: Conv-Large, supervised only\nConv-Large, -model\n\n4 000\n\n31.9\n23.33 (\u00b1 0.61)\n20.40 (\u00b1 0.47)\n\nAll\n9.31\n\n9.27\n\nnot a sign of poorly regularized baseline model. Although the improvement is not as dramatic as\nwith MNIST experiments, it came with a very simple addition to standard supervised training.\n\n5 Related Work\n\ni = ai \u02dcz(L)\n\nEarly works in semi-supervised learning [28, 29] proposed an approach where inputs x are \ufb01rst\nassigned to clusters, and each cluster has its class label. Unlabeled data would affect the shapes and\nsizes of the clusters, and thus alter the classi\ufb01cation result. Label propagation methods [30] estimate\nP (y | x), but adjust probabilistic labels q(y(n)) based on the assumption that nearest neighbors are\nlikely to have the same label. Weston et al. [15] explored deep versions of label propagation.\nThere is an interesting connection between our -model and the contractive cost used by Rifai et al.\n[16]: a linear denoising function \u02c6z(L)\ni + bi, where ai and bi are parameters, turns the\ndenoising cost into a stochastic estimate of the contractive cost. In other words, our -model seems\nto combine clustering and label propagation with regularization by contractive cost.\nRecently Miyato et al. [22] achieved impressive results with a regularization method that is similar\nto the idea of contractive cost. They required the output of the network to change as little as possible\nclose to the input samples. As this requires no labels, they were able to use unlabeled samples for\nregularization.\nThe Multi-prediction deep Boltzmann machine (MP-DBM) [5] is a way to train a DBM with back-\npropagation through variational inference. The targets of the inference include both supervised\ntargets (classi\ufb01cation) and unsupervised targets (reconstruction of missing inputs) that are used in\ntraining simultaneously. The connections through the inference network are somewhat analogous to\nour lateral connections. Speci\ufb01cally, there are inference paths from observed inputs to reconstructed\ninputs that do not go all the way up to the highest layers. Compared to our approach, MP-DBM re-\nquires an iterative inference with some initialization for the hidden activations, whereas in our case,\nthe inference is a simple single-pass feedforward procedure.\nKingma et al. [19] proposed deep generative models for semi-supervised learning, based on vari-\national autoencoders. Their models can be trained with the variational EM algorithm, stochastic\ngradient variational Bayes, or stochastic backpropagation. Compared with the Ladder network, an\ninteresting point is that the variational autoencoder computes the posterior estimate of the latent\nvariables with the encoder alone while the Ladder network uses the decoder too to compute an im-\nplicit posterior approximate (the encoder provides the likelihood part which gets combined with the\nprior).\nZeiler et al. [31] train deep convolutional autoencoders in a manner comparable to ours. They de\ufb01ne\nmax-pooling operations in the encoder to feed the max function upwards to the next layer, while the\nargmax function is fed laterally to the decoder. The network is trained one layer at a time using a\ncost function that includes a pixel-level reconstruction error, and a regularization term to promote\nsparsity. Zhao et al. [24] use a similar structure and call it the stacked what-where autoencoder\n(SWWAE). Their network is trained simultaneously to minimize a combination of the supervised\ncost and reconstruction errors on each level, just like ours.\n\n6 Discussion\n\nWe showed how a simultaneous unsupervised learning task improves CNN and MLP networks\nreaching the state-of-the-art in various semi-supervised learning tasks. Particularly the performance\n\n7\n\n\fobtained with very small numbers of labels is much better than previous published results which\nshows that the method is capable of making good use of unsupervised learning. However, the same\nmodel also achieves state-of-the-art results and a signi\ufb01cant improvement over the baseline model\nwith full labels in permutation invariant MNIST classi\ufb01cation which suggests that the unsupervised\ntask does not disturb supervised learning.\nThe proposed model is simple and easy to implement with many existing feedforward architectures,\nas the training is based on backpropagation from a simple cost function. It is quick to train and the\nconvergence is fast, thanks to batch normalization.\nNot surprisingly, the largest improvements in performance were observed in models which have a\nlarge number of parameters relative to the number of available labeled samples. With CIFAR-10,\nwe started with a model which was originally developed for a fully supervised task. This has the\nbene\ufb01t of building on existing experience but it may well be that the best results will be obtained\nwith models which have far more parameters than fully supervised approaches could handle.\nAn obvious future line of research will therefore be to study what kind of encoders and decoders are\nbest suited for the Ladder network. In this work, we made very little modi\ufb01cations to the encoders\nwhose structure has been optimized for supervised learning and we designed the parametrization of\nthe vertical mappings of the decoder to mirror the encoder: the \ufb02ow of information is just reversed.\nThere is nothing preventing the decoder to have a different structure than the encoder.\nAn interesting future line of research will be the extension of the Ladder networks to the temporal do-\nmain. While there exist datasets with millions of labeled samples for still images, it is prohibitively\ncostly to label thousands of hours of video streams. The Ladder networks can be scaled up easily\nand therefore offer an attractive approach for semi-supervised learning in such large-scale problems.\n\nAcknowledgements\n\nWe have received comments and help from a number of colleagues who would all deserve to be\nmentioned but we wish to thank especially Yann LeCun, Diederik Kingma, Aaron Courville, Ian\nGoodfellow, S\u00f8ren S\u00f8nderby, Jim Fan and Hugo Larochelle for their helpful comments and sugges-\ntions. The software for the simulations for this paper was based on Theano [32] and Blocks [33].\nWe also acknowledge the computational resources provided by the Aalto Science-IT project. The\nAcademy of Finland has supported Tapani Raiko.\n\nReferences\n[1] Harri Valpola. From neural PCA to deep unsupervised learning.\n\nAnalysis and Learning Machines, pages 143\u2013171. Elsevier, 2015. arXiv:1411.7783.\n\nIn Adv. in Independent Component\n\n[2] Steven C Suddarth and YL Kergosien. Rule-injection hints as a means of improving network performance\nand learning time. In Proceedings of the EURASIP Workshop 1990 on Neural Networks, pages 120\u2013129.\nSpringer, 1990.\n\n[3] Marc\u2019 Aurelio Ranzato and Martin Szummer. Semi-supervised learning of compact document represen-\n\ntations with deep networks. In Proc. of ICML 2008, pages 792\u2013799. ACM, 2008.\n\n[4] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative\nunsupervised feature learning with convolutional neural networks. In Advances in Neural Information\nProcessing Systems 27 (NIPS 2014), pages 766\u2013774, 2014.\n\n[5] Ian Goodfellow, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Multi-prediction deep Boltzmann\nmachines. In Advances in Neural Information Processing Systems 26 (NIPS 2013), pages 548\u2013556, 2013.\n[6] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural net-\n\nworks. Science, 313(5786):504\u2013507, 2006.\n\n[7] Antti Rasmus, Tapani Raiko, and Harri Valpola. Denoising autoencoder with modulated lateral connec-\n\ntions learns invariant representations of natural images. arXiv:1412.7210, 2015.\n\n[8] Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, and Tapani Raiko. Semi-supervised\n\nlearning with ladder networks. arXiv preprint arXiv:1507.02672, 2015.\n\n[9] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-encoders as\ngenerative models. In Advances in Neural Information Processing Systems 26 (NIPS 2013), pages 899\u2013\n907. 2013.\n\n8\n\n\f[10] Jocelyn Sietsma and Robert JF Dow. Creating arti\ufb01cial neural networks that generalize. Neural networks,\n\n4(1):67\u201379, 1991.\n\n[11] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked\ndenoising autoencoders: Learning useful representations in a deep network with a local denoising crite-\nrion. JMLR, 11:3371\u20133408, 2010.\n\n[12] Jaakko S\u00a8arel\u00a8a and Harri Valpola. Denoising source separation. JMLR, 6:233\u2013272, 2005.\n[13] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In International Conference on Machine Learning (ICML), pages 448\u2013456, 2015.\nIn the International\n\n[14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.\n\nConference on Learning Representations (ICLR 2015), San Diego, 2015. arXiv:1412.6980.\n\n[15] Jason Weston, Fr\u00b4ed\u00b4eric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-supervised\n\nembedding. In Neural Networks: Tricks of the Trade, pages 639\u2013655. Springer, 2012.\n\n[16] Salah Rifai, Yann N Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The manifold tangent\nIn Advances in Neural Information Processing Systems 24 (NIPS 2011), pages 2294\u20132302,\n\nclassi\ufb01er.\n2011.\n\n[17] Dong-Hyun Lee. Pseudo-label: The simple and ef\ufb01cient semi-supervised learning method for deep neural\n\nnetworks. In Workshop on Challenges in Representation Learning, ICML 2013, 2013.\n\n[18] Nikolaos Pitelis, Chris Russell, and Lourdes Agapito. Semi-supervised learning using an unsupervised\natlas. In Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2014), pages 565\u2013\n580. Springer, 2014.\n\n[19] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in Neural Information Processing Systems 27 (NIPS\n2014), pages 3581\u20133589, 2014.\n\n[20] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\n\nA simple way to prevent neural networks from over\ufb01tting. JMLR, 15(1):1929\u20131958, 2014.\n\n[21] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial exam-\nples. In the International Conference on Learning Representations (ICLR 2015), 2015. arXiv:1412.6572.\n[22] Takeru Miyato, Shin ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing\n\nby virtual adversarial examples. arXiv:1507.00677, 2015.\n\n[23] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving for\n\nsimplicity: The all convolutional net. arxiv:1412.6806, 2014.\n\n[24] Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked what-where auto-encoders.\n\n2015. arXiv:1506.02351.\n\n[25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv:1502.03167, 2015.\n\n[26] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout\n\nnetworks. In Proc. of ICML 2013, 2013.\n\n[27] Ian Goodfellow, Yoshua Bengio, and Aaron C Courville. Large-scale feature learning with spike-and-slab\n\nsparse coding. In Proc. of ICML 2012, pages 1439\u20131446, 2012.\n\n[28] G. McLachlan.\n\nIterative reclassi\ufb01cation procedure for constructing an asymptotically optimal rule of\n\nallocation in discriminant analysis. J. American Statistical Association, 70:365\u2013369, 1975.\n\n[29] D. Titterington, A. Smith, and U. Makov. Statistical analysis of \ufb01nite mixture distributions.\n\nSeries in Probability and Mathematical Statistics. Wiley, 1985.\n\nIn Wiley\n\n[30] Martin Szummer and Tommi Jaakkola. Partially labeled classi\ufb01cation with Markov random walks. Ad-\n\nvances in Neural Information Processing Systems 15 (NIPS 2002), 14:945\u2013952, 2003.\n\n[31] Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and\n\nhigh level feature learning. In ICCV 2011, pages 2018\u20132025. IEEE, 2011.\n\n[32] Fr\u00b4ed\u00b4eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron,\nNicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning\nand Unsupervised Feature Learning NIPS 2012 Workshop, 2012.\n\n[33] Bart van Merri\u00a8enboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Warde-Farley, Jan\nChorowski, and Yoshua Bengio. Blocks and fuel: Frameworks for deep learning. CoRR, abs/1506.00619,\n2015. URL http://arxiv.org/abs/1506.00619.\n\n9\n\n\f", "award": [], "sourceid": 1955, "authors": [{"given_name": "Antti", "family_name": "Rasmus", "institution": "The Curious AI Company"}, {"given_name": "Mathias", "family_name": "Berglund", "institution": "Aalto University"}, {"given_name": "Mikko", "family_name": "Honkala", "institution": "Nokia Labs"}, {"given_name": "Harri", "family_name": "Valpola", "institution": "The Curious AI Company"}, {"given_name": "Tapani", "family_name": "Raiko", "institution": "Aalto University, The Curious AI Company"}]}