{"title": "Recurrent Ladder Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6009, "page_last": 6019, "abstract": "We propose a recurrent extension of the Ladder networks whose structure is motivated by the inference required in hierarchical latent variable models. We demonstrate that the recurrent Ladder is able to handle a wide variety of complex learning tasks that benefit from iterative inference and temporal modeling. The architecture shows close-to-optimal results on temporal modeling of video data, competitive results on music modeling, and improved perceptual grouping based on higher order abstractions, such as stochastic textures and motion cues. We present results for fully supervised, semi-supervised, and unsupervised tasks. The results suggest that the proposed architecture and principles are powerful tools for learning a hierarchy of abstractions, learning iterative inference and handling temporal information.", "full_text": "Recurrent Ladder Networks\n\nIsabeau Pr\u00e9mont-Schwarz, Alexander Ilin, Tele Hotloo Hao,\n\nAntti Rasmus, Rinu Boney, Harri Valpola\n\n{isabeau,alexilin,hotloo,antti,rinu,harri}@cai.fi\n\nThe Curious AI Company\n\nAbstract\n\nWe propose a recurrent extension of the Ladder networks [22] whose structure\nis motivated by the inference required in hierarchical latent variable models. We\ndemonstrate that the recurrent Ladder is able to handle a wide variety of complex\nlearning tasks that bene\ufb01t from iterative inference and temporal modeling. The\narchitecture shows close-to-optimal results on temporal modeling of video data,\ncompetitive results on music modeling, and improved perceptual grouping based\non higher order abstractions, such as stochastic textures and motion cues. We\npresent results for fully supervised, semi-supervised, and unsupervised tasks. The\nresults suggest that the proposed architecture and principles are powerful tools\nfor learning a hierarchy of abstractions, learning iterative inference and handling\ntemporal information.\n\n1\n\nIntroduction\n\nMany cognitive tasks require learning useful representations on multiple abstraction levels. Hier-\narchical latent variable models are an appealing approach for learning a hierarchy of abstractions.\nThe classical way of learning such models is by postulating an explicit parametric model for the\ndistributions of random variables. The inference procedure, which evaluates the posterior distribution\nof the unknown variables, is then derived from the model \u2013 an approach adopted in probabilistic\ngraphical models (see, e.g., [5]).\nThe success of deep learning can, however, be explained by the fact that popular deep models focus\non learning the inference procedure directly. For example, a deep classi\ufb01er like AlexNet [19] is\ntrained to produce the posterior probability of the label for a given data sample. The representations\nthat the network computes at different layers are related to the inference in an implicit latent variable\nmodel but the designer of the model does not need to know about them.\nHowever, it is actually tremendously valuable to understand what kind of inference is required by\ndifferent types of probabilistic models in order to design an ef\ufb01cient network architecture. Ladder\nnetworks [22, 28] are motivated by the inference required in a hierarchical latent variable model. By\ndesign, the Ladder networks aim to emulate a message passing algorithm, which includes a bottom-up\npass (from input to label in classi\ufb01cation tasks) and a top-down pass of information (from label to\ninput). The results of the bottom-up and top-down computations are combined in a carefully selected\nmanner.\nThe original Ladder network implements only one iteration of the inference algorithm but complex\nmodels are likely to require iterative inference. In this paper, we propose a recurrent extension\nof the Ladder network for iterative inference and show that the same architecture can be used for\ntemporal modeling. We also show how to use the proposed architecture as an inference engine in\nmore complex models which can handle multiple independent objects in the sensory input. Thus, the\nproposed architecture is suitable for the type of inference required by rich models: those that can\nlearn a hierarchy of abstractions, can handle temporal information and can model multiple objects in\nthe input.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fl + 1\n\nel\n\nsl\n\nel\n\nl\n\ndl+1\ndl\n\nel1\n\ndl\n\nt 1\n\n(a)\n\ndl+1\n\ndl\n\nel\n\nel\n\nel1\n\nt\n\nz\n\ny\n\nx\n\n\u02dcx\n\nzt1 zt\n\nzt+1\n\nyt1 yt\n\nyt+1\n\nxt xt+1\n\n(b)\n\n(c)\n\nFigure 1: (a): The structure of the Recurrent Ladder networks. The encoder is shown in red, the\ndecoder is shown in blue, the decoder-to-encoder connections are shown in green. The dashed line\nseparates two iterations t 1 and t. (b)-(c): The type of hierarchical latent variable models for which\nRLadder is designed to emulate message passing. (b): A graph of a static model. (c): A fragment of\na graph of a temporal model. White circles are unobserved latent variables, gray circles represent\nobserved variables. The arrows represent the directions of message passing during inference.\n\n2 Recurrent Ladder\n\nRecurrent Ladder networks\nIn this paper, we present a recurrent extension of the Ladder networks which is conducive to iterative\ninference and temporal modeling. Recurrent Ladder (RLadder) is a recurrent neural network whose\nunits resemble the structure of the original Ladder networks [22, 28] (see Fig. 1a). At every iteration\nt, the information \ufb01rst \ufb02ows from the bottom (the input level) to the top through a stack of encoder\ncells. Then, the information \ufb02ows back from the top to the bottom through a stack of decoder cells.\nBoth the encoder and decoder cells also use the information that is propagated horizontally. Thus, at\nevery iteration t, an encoder cell in the l-th layer receives three inputs: 1) the output of the encoder\ncell from the level below el1(t), 2) the output dl(t 1) of the decoder cell from the same level from\nthe previous iteration, 3) the encoder state sl(t 1) from the same level from the previous iteration.\nIt updates its state value sl(t) and passes the same output el(t) both vertically and horizontally:\n(1)\n(2)\nThe encoder cell in the bottom layer typically sends observed data (possibly corrupted by noise) as\nits output e1(t). Each decoder cell is stateless, it receives two inputs (the output of the decoder cell\nfrom one level above and the output of the encoder cell from the same level) and produces one output\n(3)\nwhich is passed both vertically and horizontally. The exact computations performed in the cells can\nbe tuned depending on the task at hand. In practice, we have used LSTM [15] or GRU [8] cells in the\nencoder and cells inspired by the original Ladder networks in the decoder (see Appendix A).\nSimilarly to Ladder networks, the RLadder is usually trained with multiple tasks at different abstrac-\ntion levels. Tasks at the highest abstraction level (like classi\ufb01cation) are typically formulated at the\nhighest layer. Conversely, the output of the decoder cell in the bottom level is used to formulate\na low-level task which corresponds to abstractions close to the input. The low-level task can be\ndenoising (reconstruction of a clean input from the corrupted one), other possibilities include object\ndetection [21], segmentation [3, 23], or in a temporal setting, prediction. A weighted sum of the costs\nat different levels is optimized during training.\n\nsl(t) = fs,l(el1(t), dl(t 1), sl(t 1))\nel(t) = fe,l(el1(t), dl(t 1), sl(t 1)) .\n\ndl(t) = gl(el(t), dl+1(t)) ,\n\nConnection to hierarchical latent variables and message passing\nThe RLadder architecture is designed to mimic the computational structure of an inference procedure\nin probabilistic hierarchical latent variable models. In an explicit probabilistic graphical model,\ninference can be done by an algorithm which propagates information (messages) between the nodes\nof a graphical model so as to compute the posterior distribution of the latent variables (see, e.g., [5]).\n\n2\n\n\fFor static graphical models implicitly assumed by the RLadder (see Fig. 1b), messages need to be\npropagated from the input level up the hierarchy to the highest level and from the top to the bottom,\nas shown in Fig. 1a. In Appendix B, we present a derived iterative inference procedure for a simple\nstatic hierarchical model to give an example of a message-passing algorithm. We also show how that\ninference procedure can be implemented in the RLadder computational graph.\nIn the case of temporal modeling, the type of a graphical model assumed by the RLadder is shown\nin Fig. 1c. If the task is to do next step prediction of observations x, an online inference procedure\nshould update the knowledge about the latent variables yt, zt using observed data xt and compute the\npredictive distributions for the input xt+1. Assuming that the distributions of the latent variables at\nprevious time instances (\u2327< t ) are kept \ufb01xed, the inference can be done by propagating messages\nfrom the observed variables xt and the latent variables y, z bottom-up, top-down and from the past to\nthe future, as shown in Fig. 1c. The architecture of the RLadder (Fig. 1a) is designed so as to emulate\nsuch a message-passing procedure, that is the information can propagate in all the required directions:\nbottom-up, top-down and from the past to the future. In Appendix C, we present an example of the\nmessage-passing algorithm derived for a temporal hierarchical model to show how it is related to the\nRLadders\u2019s computation graph.\nEven though the motivation of the RLadder architecture is to emulate a message-passing procedure,\nthe nodes of the RLadder do not directly correspond to nodes of any speci\ufb01c graphical model.1 The\nRLadder directly learns an inference procedure and the corresponding model is never formulated\nexplicitly. Note also that using stateful encoder cells is not strictly motivated by the message-passing\nargument but in practice these skip connections facilitate training of a deep network.\nAs we mentioned previously, the RLadder is usually trained with multiple tasks formulated at\ndifferent representation levels. The purpose of tasks is to encourage the RLadder to learn the right\ninference procedure, and hence formulating the right kind of tasks is crucial for the success of training.\nFor example, the task of denoising encourages the network to learn important aspects of the data\ndistribution [1, 2]. For temporal modeling, the task of next step prediction plays a similar role. The\nRLadder is most useful in problems that require accurate inference on multiple abstraction levels,\nwhich is supported by the experiments presented in this paper.\n\nRelated work\n\nThe RLadder architecture is similar to that of other recently proposed models for temporal modeling\n[10, 11, 9, 27, 20]. In [9], the recurrent connections (from time t 1 to time t) are placed in the\nlateral links between the encoder and the decoder. This can make it easier to extend an existing\nfeed-forward network architecture to the case of temporal data as the recurrent units do not participate\nin the bottom-up computations. On the other hand, the recurrent units do not receive information from\nthe top, which makes it impossible for higher layers to in\ufb02uence the dynamics of lower layers. The\narchitectures in [10, 11, 27] are quite similar to ours but they could potentially derive further bene\ufb01t\nfrom the decoder-to-encoder connections between successive time instances (green links in Fig. 1b).\nThe aforementioned connections are well justi\ufb01ed from the message-passing point of view: When\nupdating the posterior distribution of a latent variable, one should combine the latest information\nfrom the top and from the bottom, and it is the decoder that contains the latest information from the\ntop. We show empirical evidence to the importance of those connections in Section 3.1.\n\n3 Experiments with temporal data\n\nIn this section, we demonstrate that the RLadder can learn an accurate inference algorithm in tasks\nthat require temporal modeling. We consider datasets in which passing information both in time and\nin abstraction hierarchy is important for achieving good performance.\n\n3.1 Occluded Moving MNIST\n\nWe use a dataset where we know how to do optimal inference in order to be able to compare the\nresults of the RLadder to the optimal ones. To this end we designed the Occluded Moving MNIST\n\n1To emphasize this, we used different shapes for the nodes of the RLadder network (Fig. 1a) and the nodes\n\nof graphical models that inspired the RLadder architecture (Figs. 1b-c).\n\n3\n\n\ft = 1\n\nt = 2\n\nt = 3\n\nt = 4\n\nt = 5\n\nobserved frames\n\nframes with\nocclusion\nvisualized\n\noptimal temporal\nreconstruction\n\nFigure 2: The Occluded Moving MNIST dataset. Bottom row: Optimal temporal recombination for a\nsequence of occluded frames from the dataset.\n\ndataset.\nIt consists of MNIST digits downscaled to 14 \u21e5 14 pixels \ufb02ying on a 32 \u21e5 32 white\nbackground with white vertical and horizontal occlusion bars (4 pixels in width, and spaced by 8\nvisible pixels apart) which, when the digit \ufb02ies behind them, occludes the pixels of the digit (see\nFig. 2). We also restrict the velocities to be randomly chosen in the set of eight discrete velocities\n{(1,\u00b12), (1,\u00b12), (2,\u00b11), (2,\u00b11)} pixels/frame, so that apart from the bouncing, the movement\nis deterministic. The digits are split into training, validation, and test sets according to the original\nMNIST split. The primary task is then to classify the digit which is only partially observable at any\ngiven moment, at the end of \ufb01ve time steps.\nIn order to do optimal classi\ufb01cation, one would need to assimilate information about the digit identity\n(which is only partially visible at any given time instance) by keeping track of the observed pixels\n(see the bottom row of Fig. 2) and then feeding the resultant reconstruction to a classi\ufb01er.\nIn order to encourage optimal inference, we add a next step prediction task to the RLadder at the\nbottom of the decoder: The RLadder is trained to predict the next occluded frame, that is the network\nnever sees the un-occluded digit. This thus mimics a realistic scenario where the ground truth is\nnot known. To assess the importance of the features of the RLadder, we also do an ablation study.\nIn addition, we compare it to three other networks. In the \ufb01rst comparison network, the optimal\nreconstruction of the digit from the \ufb01ve frames (as shown in Fig. 2) is fed to a static feed-forward\nnetwork from which the encoder of the RLadder was derived. This is our gold standard, and obtaining\nsimilar results to it implies doing close to optimal temporal inference. The second, a temporal baseline,\nis a deep feed-forward network (the one on which the encoder is based) with a recurrent neural\nnetwork (RNN) at the top only so that, by design the network can propagate temporal information\nonly at a high level, and not at a low level. The third, a hierarchical RNN, is a stack of convolutional\nLSTM units with a few convolutional layers in between, which is the RLadder amputated of its\ndecoder. See Fig. 3 and Appendix D.1 for schematics and details of the architectures.\n\nFully supervised learning results. The results are presented in Table 1. The \ufb01rst thing to notice\nis that the RLadder reaches (up to uncertainty levels) the classi\ufb01cation accuracy obtained by the\nnetwork which was given the optimal reconstruction of the digit. Furthermore, if the RLadder does\nnot have a decoder or the decoder-to-encoder connections, or if it is trained without the auxiliary\nprediction task, we see the classi\ufb01cation error rise almost to the level of the temporal baseline. This\nmeans that even if a network has RNNs at the lowest levels (like with only the feed-forward encoder),\nor if it does not have a task which encourages it to develop a good world model (like the RLadder\nwithout the next-frame prediction task), or if the information cannot travel from the decoder to the\nencoder, the high level task cannot truly bene\ufb01t from lower level temporal modeling.\nNext, one notices from Table 1 that the top-level classi\ufb01cation cost helps the low-level prediction\ncost in the RLadder (which in turn helps the top-level cost in a mutually bene\ufb01cial cycle). This\nmutually supportive relationship between high-level and low-level inferences is nicely illustrated by\nthe example in Fig. 4. Up until time step t = 3 inclusively, the network believes the digit to be a \ufb01ve\n\n4\n\n\fxt1\n\nxt\n\nxt1\n\nxt\n\nxt1\n\n\u02c6xt\n\nxt\n\n\u02c6xt+1\n\nTemporal baseline network\n\nHierarchical RNN\n\nRLadder\n\nFigure 3: Architectures used for modeling occluded Moving MNIST. Temporal baseline network is a\nconvolutional network with a fully connected RNN on top.\n\nTable 1: Performance on Occluded Moving MNIST\n\nClassi\ufb01cation error (%)\n\nOptimal reconstruction and static classi\ufb01er\nTemporal baseline\nHierarchical RNN (encoder only)\nRLadder w/o prediction task\nRLadder w/o decoder-to-encoder conn.\nRLadder w/o classi\ufb01cation task\nRLadder\n\n0.71 \u00b1 0.03\n2.02 \u00b1 0.16\n1.60 \u00b1 0.05\n1.51 \u00b1 0.21\n1.24 \u00b1 0.05\n0.74 \u00b1 0.09\n0.74 \u00b1 0.09\n0.74 \u00b1 0.09\n\nPrediction error, \u00b7105\n\n156.7 \u00b1 0.4\n155.2 \u00b1 2.5\n150.1 \u00b1 0.1\n150.1 \u00b1 0.1\n150.1 \u00b1 0.1\n\n(Fig. 4a). As such, at t = 3, the network predicts that the top right part of the \ufb01ve which has been\noccluded so far will stick out from behind the occlusions as the digit moves up and right at the next\ntime step (Fig. 4b). Using the decoder-to-encoder connections, the decoder can relay this expectation\nto the encoder at t = 4. At t = 4 the encoder can compare this expectation with the actual input\nwhere the top right part of the \ufb01ve is absent (Fig. 4c). Without the decoder-to-encoder connections\nthis comparison would have been impossible. Using the upward path of the encoder, the network can\nrelay this discrepancy to the higher classi\ufb01cation layers. These higher layers with a large receptive\n\ufb01eld can then conclude that since it is not a \ufb01ve, then it must be a three (Fig. 4d). Now thanks to the\ndecoder, the higher classi\ufb01cation layers can relay this information to the lower prediction layers so\nthat they can change their prediction of what will be seen at t = 5 appropriately (Fig. 4e). Without a\ndecoder which would bring this high level information back down to the low level, this drastic update\nof the prediction would be impossible. With this information the lower prediction layer can now\npredict that the top-left part of the three (which it has never seen before) will appear at the next time\nstep from behind the occlusion, which is indeed what happens at t = 5 (Fig. 4f).\n\nSemi-supervised learning results.\nIn the following experiment, we test the RLadder in the semi-\nsupervised scenario when the training data set contains 1.000 labeled sequences and 59.000 unlabeled\nones. To make use of the unlabeled data, we added an extra auxiliary task at the top level which\nwas the consistency cost with the targets provided by the Mean Teacher (MT) model [26]. Thus,\nthe RLadder was trained with three tasks: 1) next step prediction at the bottom, 2) classi\ufb01cation\nat the top, 3) consistency with the MT outputs at the top. As shown in Table 2, the RLadder\nimproves dramatically by learning a better model with the help of unlabeled data independently and in\naddition to other semi-supervised learning methods. The temporal baseline model also improves the\nclassi\ufb01cation accuracy by using the consistency cost but it is clearly outperformed by the RLadder.\n\n3.2 Polyphonic Music Dataset\n\nIn this section, we evaluate the RLadder on the midi dataset converted to piano rolls [6]. The dataset\nconsists of piano rolls (the notes played at every time step, where a time step is, in this case, an eighth\nnote) of various piano pieces. We train an 18-layer RLadder containing \ufb01ve convolutional LSTMs\nand one fully-connected LSTM. More details can be found in Appendix D.2. Table 3 shows the\n\n5\n\n\ft = 1\n\nt = 2\n\nt = 3\n\nt = 4\n\nt = 5\n\nf\n\nc\n\ne\n\nd\n\nb\n\na\n\nground-truth\nunoccluded\ndigits\n\nobserved\nframes\n\npredicted\nframes\n\nprobe of inter-\nnal representa-\ntions\n\nFigure 4: Example prediction of an RLadder on the occluded moving MNIST dataset. First row: the\nground truth of the digit, which the network never sees and does not train on. Second row: The actual\n\ufb01ve frames seen by the network and on which it trains. Third row: the predicted next frames of a\ntrained RLadder. Fourth row: A stopped-gradient (gradient does not \ufb02ow into the RLadder) readout\nof the bottom layer of the decoder trained on the ground truth to probe what aspects of the digit are\nrepresented by the neurons which predict the next frame. Notice how at t = 1, the network does\nnot yet know in which direction the digit will move and so it predicts a superposition of possible\nmovements. Notice further (red annotations a-f), that until t = 3, the network thought the digit was a\n\ufb01ve, but when the top bar of the supposed \ufb01ve did not materialize on the other side of the occlusion\nas expected at t = 4, the network immediately concluded correctly that it was actually a three.\n\nTable 2: Classi\ufb01cation error (%) on semi-supervised Occluded Moving MNIST\n\nOptimal reconstruction and static classi\ufb01er\nTemporal baseline\nRLadder\n\n1k labeled\n\n3.50 \u00b1 0.28\n10.86 \u00b1 0.43\n10.49 \u00b1 0.81\n\nw/o MT\n3.50 \u00b1 0.28\n10.86 \u00b1 0.43\n5.20 \u00b1 0.77\n\n1k labeled & 59k unlabeled\n\nMT\n\n1.34 \u00b1 0.04\n3.14 \u00b1 0.16\n1.69 \u00b1 0.14\n\nnegative log-likelhoods of the next-step prediction obtained on the music dataset, where our results\nare reported as mean plus or minus standard deviation over 10 seeds. We see that the RLadder is\ncompetitive with the best results, and gives the best results amongst models outputting the marginal\ndistribution of notes at each time step.\nThe fact that the RLadder did not beat [16] on the midi datasets shows one of the limitations of\nRLadder. Most of the models in Table 3 output a joint probability distribution of notes, unlike\nRLadder which outputs the marginal probability for each note. That is to say, those models, to output\nthe probability of a note, take as input the notes at previous time instances, but also the ground truth\nof the notes to the left at the same time instance. RLadder does not do that, it only takes as input the\npast notes played. Even though, as the example in 3.1 of the the digit \ufb01ve turning into a three after\nseeing only one absent dot, shows that internally the RLadder models the joint distribution.\n\n4 Experiments with perceptual grouping\n\nIn this section, we show that the RLadder can be used as an inference engine in a complex model which\nbene\ufb01ts from iterative inference and temporal modeling. We consider the task of perceptual grouping,\nthat is identifying which parts of the sensory input belong to the same higher-level perceptual\n\n6\n\n\fTable 3: Negative log-likelihood (smaller is better) on polyphonic music dataset\n\nPiano-midi.de\n\nNottingham\n\nMuse\n\nJSB Chorales\n\n7.42\n7.05\n7.09\n7.05\n7.39\n5.49\n5.005.005.00\n\nModels outputting a joint distribution of notes:\nNADE masked [4]\nNADE [4]\nRNN-RBM [6]\nRNN-NADE (HF) [6]\nLSTM-NADE [16]\nTP-LSTM-NADE [16]\nBALSTM [16]\nModels outputting marginal probabilities for each note:\nRNN [4]\nLSTM [17]\nMUT1 [17]\nRLadder\n\n3.32\n2.89\n2.39\n2.31\n2.06\n1.64\n1.621.621.62\n\n7.88\n6.866\n6.792\n\n3.87\n3.492\n3.254\n\n6.19 \u00b1 0.02\n6.19 \u00b1 0.02\n6.19 \u00b1 0.02\n\n2.42 \u00b1 0.03\n2.42 \u00b1 0.03\n2.42 \u00b1 0.03\n\n6.48\n5.54\n6.01\n5.60\n5.03\n4.34\n3.903.903.90\n\n7.43\n\n8.51\n7.59\n6.27\n5.565.565.56\n6.10\n5.92\n5.86\n\n8.76\n\n5.69 \u00b1 0.02\n5.69 \u00b1 0.02\n5.69 \u00b1 0.02\n\n5.64 \u00b1 0.02\n5.64 \u00b1 0.02\n5.64 \u00b1 0.02\n\ncomponents (objects). We enhance the previously developed model for perceptual grouping called\nTagger [13] by replacing the originally used Ladder engine with the RLadder. For another perspective\non the problem see [14] which also extends Tagger to a recurrent neural network, but does so from an\nexpectation maximization point of view.\n\n4.1 Recurrent Tagger\nTagger is a model designed for perceptual grouping. When applied to images, the modeling assump-\ntion is that each pixel \u02dcxi belongs to one of the K objects, which is described by binary variables zi,k:\nzi,k = 1 if pixel i belongs to object k and zi,k = 0 otherwise. The reconstruction of the whole image\nusing object k only is \u00b5\u00b5\u00b5k which is a vector with as many elements \u00b5i,k as there are pixels. Thus, the\nassumed probabilistic model can be written as follows:\n\np(\u02dcx, \u00b5\u00b5\u00b5, z, h) =Yi,k\n\nKYk=1\n\nN (\u02dcxi|\u00b5i,k, 2\n\nk)zi,k\n\np(zk, \u00b5\u00b5\u00b5k|hk)p(hk)\n\n(4)\n\nwhere zk is a vector of elements zi,k and hk is (a hierarchy of) latent variables which de\ufb01ne the shape\nand the texture of the objects. See Fig. 5a for a graphical representation of the model and Fig. 5b\nfor possible values of model variables for the textured MNIST dataset used in the experiments of\nSection 4.2. The model in (4) is de\ufb01ned for noisy image \u02dcx because Tagger is trained with an auxiliary\nlow-level task of denoising. The inference procedure in model (4) should evaluate the posterior\ndistributions of the latent variables zk, \u00b5\u00b5\u00b5k, hk for each of the K groups given corrupted data \u02dcx.\nMaking the approximation that the variables of each of the K groups are independent a posteriori\n\nq(zk, \u00b5\u00b5\u00b5k, hk) ,\n\n(5)\n\np(z, \u00b5\u00b5\u00b5, h|\u02dcx) \u21e1Yk\n\nthe inference procedure could be implemented by iteratively updating each of the K approximate\ndistributions q(zk, \u00b5\u00b5\u00b5k, hk), if the model (4) and the approximation (5) were de\ufb01ned explicitly.\nTagger does not explicitly de\ufb01ne a probabilistic model (4) but learns the inference procedure directly.\nThe iterative inference procedure is implemented by a computational graph with K copies of the same\nLadder network doing inference for one of the groups (see Fig. 5c). At the end of every iteration, the\ninference procedure produces the posterior probabilities \u21e1i,k that pixel i belongs to object k and the\npoint estimates of the reconstructions \u00b5\u00b5\u00b5k (see Fig. 5c). Those outputs, are used to form the low-level\ncost and the inputs for the next iteration (see more details in [13]). In this paper, we replace the\noriginal Ladder engine of Tagger with the RLadder. We refer to the new model as RTagger.\n\n4.2 Experiments on grouping using texture information\nThe goal of the following experiment is to test the ef\ufb01ciency of RTagger in grouping objects using\nthe texture information. To this end, we created a dataset that contains thickened MNIST digits with\n\n7\n\n\fhk\n\n\u00b5\u00b5\u00b5k\n\nzk\n\nK\n\n\u02dcx\n\nx\n\n(a)\n\nh1\n\nh2\n\n\u00b5\u00b5\u00b51\n\nz1\n\n\u00b5\u00b5\u00b52\n\nz2\n\nx\n\n(b)\n\nK\n\n\u02dcx\n\n\u21e1\u21e1\u21e1, \u00b5\u00b5\u00b5\n\n\u02dcx\n\n\u21e1\u21e1\u21e1, \u00b5\u00b5\u00b5\n\n(c)\n\nFigure 5: (a): Graphical model for perceptual grouping. White circles are unobserved latent variables,\ngray circles represent observed variables. (b): Examples of possible values of model variables for the\ntextured MNIST dataset. (c): Computational graph that implements iterative inference in perceptual\ngrouping task (RTagger). Two graph iterations are drawn. The plate notation represent K copies of\nthe same graph.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 6: (a): Example image from the Brodatz-textured MNIST dataset. (b): The image reconstruc-\ntion m0 by the group that learned the background. (c): The image reeconstruction m1 by the group\nthat learned the digit. (d): The original image colored using the found grouping \u21e1\u21e1\u21e1k.\n\n20 textures from the Brodatz dataset [7]. An example of a generated image is shown in Fig. 6a. To\ncreate a greater diversity of textures (to avoid over-\ufb01tting), we randomly rotated and scaled the 20\nBrodatz textures when producing the training data.\nThe network trained on the textured MNIST dataset has the architecture presented in Fig. 5c with\nthree iterations. The number of groups was set to K = 3. The details of the RLadder architecture are\npresented in Appendix D.3. The network was trained on two tasks: The low-level segmentation task\nwas formulated around denoising, the same way as in the Tagger model [13]. The top-level cost was\nthe log-likelihood of the digit class at the last iteration.\nTable 4 presents the obtained performance on the textured MNIST dataset in both fully supervised\nand semi-supervised settings. All experiments were run over 5 seeds. We report our results as mean\nplus or minus standard deviation. In some runs, Tagger experiments did not converge to a reasonable\nsolution (because of unstable or too slow convergence), so we did not include those runs in our\nevaluations. Following [13], the segmentation accuracy was computed using the adjusted mutual\ninformation (AMI) score [29] which is the mutual information between the ground truth segmentation\nand the estimated segmentation \u21e1\u21e1\u21e1k scaled to give one when the segmentations are identical and zero\nwhen the output segmentation is random.\nFor comparison, we trained the Tagger model [13] on the same dataset. The other comparison method\nwas a feed-forward convolutional network which had an architecture resembling the bottom-up pass\n(encoder) of the RLadder and which was trained on the classi\ufb01cation task only. One thing to notice is\nthat the results obtained with the RTagger clearly improve over iterations, which supports the idea that\niterative inference is useful in complex cognitive tasks. We also observe that RTagger outperforms\nTagger and both approaches signi\ufb01cantly outperform the convolutional network baseline in which the\nclassi\ufb01cation task is not supported by the input-level task. We have also observed that the top-level\nclassi\ufb01cation tasks makes the RTagger faster to train in terms of the number of updates, which also\nsupports that the high-level and low-level tasks mutually bene\ufb01t from each other: Detecting object\n\n8\n\n\fTable 4: Results on the Brodatz-textured MNIST. i-th column corresponds to the intermediate results\nof RTagger after the i-th iteration. In the fully supervised case, Tagger was only trained successfully\nin 2 of the 5 seeds, the given results are for those 2 seeds. In the semi-supervised case, we were not\nable to train Tagger successfully.\n50k labeled\n\n1k labeled + 49k unlabeled\n\n0.75\n\n0.55\n\n\n0.80 \u00b1 0.01\n0.80 \u00b1 0.01\n0.80 \u00b1 0.01\n 0.73 \u00b1 0.02\n\nSegmentation accuracy, AMI:\nRTagger\nTagger\nClassi\ufb01cation error, %:\nRTagger\n5.9 \u00b1 0.2\n5.9 \u00b1 0.2\n5.9 \u00b1 0.2\n8.0\nTagger\n 12.15 \u00b1 0.1\nConvNet\n\u2013\n14.3 \u00b1 0.46\n\n18.2\n\n\u2013\n\n0.56\n\nSegmentation accuracy, AMI:\nRTagger\n0.74\nClassi\ufb01cation error, %:\nRTagger\n28.2\nConvNet\n\u2013\n\n63.8\n\u2013\n\n0.80 \u00b1 0.03\n0.80 \u00b1 0.03\n0.80 \u00b1 0.03\n\n22.6 \u00b1 6.2\n22.6 \u00b1 6.2\n22.6 \u00b1 6.2\n88 \u00b1 0.30\n\nFigure 7: Example of segmentation and generation by the RTagger trained on the Moving MNIST.\nFirst row: frames 0-9 is the input sequence, frames 10-15 is the ground truth future. Second row:\nNext step prediction of frames 1-9 and future frame generation (frames 10-15) by RTagger, the colors\nrepresent grouping performed by RTagger.\n\nboundaries using textures helps classify a digit, while knowing the class of the digit helps detect the\nobject boundaries. Figs. 6b-d show the reconstructed textures and the segmentation results for the\nimage from Fig. 6a.\n\n4.3 Experiments on grouping using movement information\n\nThe same RTagger model can perform perceptual grouping in video sequences using motion cues. To\ndemonstrate this, we applied the RTagger to the moving MNIST [25]2 sequences of length 20 and the\nlow-level task was prediction of the next frame. When applied to temporal data, the RTagger assumes\nthe existence of K objects whose dynamics are independent of each other. Using this assumption,\nthe RTagger can separate the two moving digits into different groups. We assessed the segmentation\nquality by the AMI score which was computed similarly to [13, 12] ignoring the background in the\ncase of a uniform zero-valued background and overlap regions where different objects have the same\ncolor. The achieved averageAMI score was 0.75. An example of segmentation is shown in Fig. 7.\nWhen we tried to use Tagger on the same dataset, we were only able to train successfully in a single\nseed out of three. This is possibly because speed is intermediate level of abstraction not represented\nat the pixel level. Due to its reccurent connections, RTagger can keep those representations from\none time step to the next and segment accordingly, something more dif\ufb01cult for Tagger to do, which\nmight explain the training instability.\n\n5 Conclusions\n\nIn the paper, we presented recurrent Ladder networks. The proposed architecture is motivated by\nthe computations required in a hierarchical latent variable model. We empirically validated that the\nrecurrent Ladder is able to learn accurate inference in challenging tasks which require modeling\ndependencies on multiple abstraction levels, iterative inference and temporal modeling. The proposed\nmodel outperformed strong baseline methods on two challenging classi\ufb01cation tasks. It also produced\ncompetitive results on a temporal music dataset. We envision that the purposed Recurrent Ladder\nwill be a powerful building block for solving dif\ufb01cult cognitive tasks.\n\n2For this experiment, in order to have the ground truth segmentation, we reimplemented the dataset ourselves.\n\n9\n\n\fAcknowledgments\n\nWe would like to thank Klaus Greff and our colleagues from The Curious AI Company for their\ncontribution in the presented work, especially Vikram Kamath and Matti Herranen.\n\nReferences\n[1] Alain, G., Bengio, Y., and Rifai, S. (2012). Regularized auto-encoders estimate local statistics. CoRR,\n\nabs/1211.4246.\n\n[2] Arponen, H., Herranen, M., and Valpola, H. (2017). On the exact relationship between the denoising\n\nfunction and the data distribution. arXiv preprint arXiv:1709.02797.\n\n[3] Badrinarayanan, V., Kendall, A., and Cipolla, R. (2015). Segnet: A deep convolutional encoder-decoder\n\narchitecture for image segmentation. arXiv preprint arXiv:1511.00561.\n\n[4] Berglund, M., Raiko, T., Honkala, M., K\u00e4rkk\u00e4inen, L., Vetek, A., and Karhunen, J. T. (2015). Bidirectional\n\nrecurrent neural networks as generative models. In Advances in Neural Information Processing Systems.\n\n[5] Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics).\n\nSpringer-Verlag New York, Inc., Secaucus, NJ, USA.\n\n[6] Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling temporal dependencies in\nhigh-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings\nof the 29th International Conference on Machine Learning (ICML-12), pages 1159\u20131166.\n\n[7] Brodatz, P. (1966). Textures: a photographic album for artists and designers. Dover Pubns.\n\n[8] Cho, K., Van Merri\u00ebnboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine\n\ntranslation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.\n\n[9] Cricri, F., Honkala, M., Ni, X., Aksu, E., and Gabbouj, M. (2016). Video Ladder networks. arXiv preprint\n\narXiv:1612.01756.\n\n[10] Eyjolfsdottir, E., Branson, K., Yue, Y., and Perona, P. (2016). Learning recurrent representations for\n\nhierarchical behavior modeling. arXiv preprint arXiv:1611.00094.\n\n[11] Finn, C., Goodfellow, I. J., and Levine, S. (2016). Unsupervised learning for physical interaction through\n\nvideo prediction. In Advances in Neural Information Processing Systems 29.\n\n[12] Greff, K., Srivastava, R. K., and Schmidhuber, J. (2015). Binding via reconstruction clustering. CoRR,\n\nabs/1511.06418.\n\n[13] Greff, K., Rasmus, A., Berglund, M., Hao, T., Valpola, H., and Schmidhuber, J. (2016). Tagger: Deep\n\nunsupervised perceptual grouping. In Advances in Neural Information Processing Systems 29.\n\n[14] Greff, K., van Steenkiste, S., and Schmidhuber, J. (2017). Neural expectation maximization. In ICLR\n\nWorkshop.\n\n[15] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735\u2013\n\n1780.\n\n[16] Johnson, D. D. (2017). Generating polyphonic music using tied parallel networks. In International\n\nConference on Evolutionary and Biologically Inspired Music and Art.\n\n[17] Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015). An empirical exploration of recurrent network\n\narchitectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15).\n\n[18] Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization.\n\nConference on Learning Representations (ICLR), San Diego.\n\nIn The International\n\n[19] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems.\n\n10\n\n\f[20] Laukien, E., Crowder, R., and Byrne, F. (2016). Feynman machine: The universal dynamical systems\n\ncomputer. arXiv preprint arXiv:1609.03971.\n\n[21] Newell, A., Yang, K., and Deng, J. (2016). Stacked hourglass networks for human pose estimation. In\n\nEuropean Conference on Computer Vision. Springer.\n\n[22] Rasmus, A., Berglund, M., Honkala, M., Valpola, H., and Raiko, T. (2015). Semi-supervised learning with\n\nLadder networks. In Advances in Neural Information Processing Systems.\n\n[23] Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image\nsegmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention.\n\n[24] Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2014). Striving for simplicity: The all\n\nconvolutional net. arXiv preprint arXiv:1412.6806.\n\n[25] Srivastava, N., Mansimov, E., and Salakhudinov, R. (2015). Unsupervised learning of video representations\n\nusing LSTMs. In International Conference on Machine Learning, pages 843\u2013852.\n\n[26] Tarvainen, A. and Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency\ntargets improve semi-supervised deep learning results. In Advances in neural information processing systems.\n\n[27] Tietz, M., Alpay, T., Twiefel, J., and Wermter, S. (2017). Semi-supervised phoneme recognition with\n\nrecurrent ladder networks. In International Conference on Arti\ufb01cial Neural Networks 2017.\n\n[28] Valpola, H. (2015). From neural PCA to deep unsupervised learning. Advances in Independent Component\n\nAnalysis and Learning Machines.\n\n[29] Vinh, N. X., Epps, J., and Bailey, J. (2010). Information theoretic measures for clusterings comparison:\nVariants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct),\n2837\u20132854.\n\n11\n\n\f", "award": [], "sourceid": 3060, "authors": [{"given_name": "Isabeau", "family_name": "Pr\u00e9mont-Schwarz", "institution": "The Curious Ai Company"}, {"given_name": "Alexander", "family_name": "Ilin", "institution": "The Curious AI company"}, {"given_name": "Tele", "family_name": "Hao", "institution": "The Curious AI Company"}, {"given_name": "Antti", "family_name": "Rasmus", "institution": null}, {"given_name": "Rinu", "family_name": "Boney", "institution": null}, {"given_name": "Harri", "family_name": "Valpola", "institution": "The Curious AI Company"}]}