{"title": "Variational Autoencoder for Deep Learning of Images, Labels and Captions", "book": "Advances in Neural Information Processing Systems", "page_first": 2352, "page_last": 2360, "abstract": "A novel variational autoencoder is developed to model images, as well as associated labels or captions. The Deep Generative Deconvolutional Network (DGDN) is used as a decoder of the latent image features, and a deep Convolutional Neural Network (CNN) is used as an image encoder; the CNN is used to approximate a distribution for the latent DGDN features/code. The latent code is also linked to generative models for labels (Bayesian support vector machine) or captions (recurrent neural network). When predicting a label/caption for a new image at test, averaging is performed across the distribution of latent codes; this is computationally efficient as a consequence of the learned CNN-based encoder. Since the framework is capable of modeling the image in the presence/absence of associated labels/captions, a new semi-supervised setting is manifested for CNN learning with images; the framework even allows unsupervised CNN learning, based on images alone.", "full_text": "Variational Autoencoder for Deep Learning\n\nof Images, Labels and Captions\n\nYunchen Pu\u2020, Zhe Gan\u2020, Ricardo Henao\u2020, Xin Yuan\u2021, Chunyuan Li\u2020, Andrew Stevens\u2020\n\n\u2020Department of Electrical and Computer Engineering, Duke University\n\n{yp42, zg27, r.henao, cl319, ajs104, lcarin}@duke.edu\n\nand Lawrence Carin\u2020\n\n\u2021Nokia Bell Labs, Murray Hill\nxyuan@bell-labs.com\n\nAbstract\n\nA novel variational autoencoder is developed to model images, as well as associated\nlabels or captions. The Deep Generative Deconvolutional Network (DGDN) is used\nas a decoder of the latent image features, and a deep Convolutional Neural Network\n(CNN) is used as an image encoder; the CNN is used to approximate a distribution\nfor the latent DGDN features/code. The latent code is also linked to generative\nmodels for labels (Bayesian support vector machine) or captions (recurrent neural\nnetwork). When predicting a label/caption for a new image at test, averaging is\nperformed across the distribution of latent codes; this is computationally ef\ufb01cient as\na consequence of the learned CNN-based encoder. Since the framework is capable\nof modeling the image in the presence/absence of associated labels/captions, a\nnew semi-supervised setting is manifested for CNN learning with images; the\nframework even allows unsupervised CNN learning, based on images alone.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) [1] are effective tools for image analysis [2], with most CNNs\ntrained in a supervised manner [2, 3, 4]. In addition to being used in image classi\ufb01ers, image features\nlearned by a CNN have been used to develop models for image captions [5, 6, 7]. Most recent work\non image captioning employs a CNN for image encoding, with a recurrent neural network (RNN)\nemployed as a decoder of the CNN features, generating a caption.\nWhile large sets of labeled and captioned images have been assembled, in practice one typically\nencounters far more images without labels or captions. To leverage the vast quantity of these latter\nimages (and to tune a model to the speci\ufb01c unlabeled/uncaptioned images of interest at test), semi-\nsupervised learning of image features is of interest. To account for unlabeled/uncaptioned images,\nit is useful to employ a generative image model, such as the recently developed Deep Generative\nDeconvolutional Network (DGDN) [8, 9]. However, while the CNN is a feedforward model for image\nfeatures (and is therefore fast at test time), the original DGDN implementation required relatively\nexpensive inference of the latent image features. Speci\ufb01cally, in [8] parameter learning and inference\nare performed with Gibbs sampling or Monte Carlo Expectation-Maximization (MCEM).\nWe develop a new variational autoencoder (VAE) [10] setup to analyze images. The DGDN [8] is\nused as a decoder, and the encoder for the distribution of latent DGDN parameters is based on a\nCNN (termed a \u201crecognition model\u201d [10, 11]). Since a CNN is used within the recognition model,\ntest-time speed is much faster than that achieved in [8]. The VAE framework manifests a novel means\nof semi-supervised CNN learning: a Bayesian SVM [12] leverages available image labels, the DGDN\nmodels the images (with or without labels), and the CNN manifests a fast encoder for the distribution\nof latent codes. For image-caption modeling, latent codes are shared between the CNN encoder,\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fDGDN decoder, and RNN caption model; the VAE learns all model parameters jointly. These models\nare also applicable to images alone, yielding an unsupervised method for CNN learning.\nOur DGDN-CNN model for images is related to but distinct from prior convolutional variational\nauto-encoder networks [13, 14, 15]. In those models the pooling process in the encoder network is\ndeterministic (max-pooling), as is the unpooling process in the decoder [14] (related to upsampling\n[13]). Our model uses stochastic unpooling, in which the unpooling map (upsampling) is inferred\nfrom the data, by maximizing a variational lower bound.\nSummarizing, the contributions of this paper include: (i) a new VAE-based method for deep decon-\nvolutional learning, with a CNN employed within a recognition model (encoder) for the posterior\ndistribution of the parameters of the image generative model (decoder); (ii) demonstration that the fast\nCNN-based encoder applied to the DGDN yields accuracy comparable to that provided by Gibbs sam-\npling and MCEM based inference, while being much faster at test time; (iii) the \ufb01rst semi-supervised\nCNN classi\ufb01cation results, applied to large-scale image datasets; and (iv) extensive experiments\non image-caption modeling, in which we demonstrate the advantages of jointly learning the image\nfeatures and caption model (we also present semi-supervised experiments for image captioning).\n\n2 Variational Autoencoder Image Model\n\nImage Decoder: Deep Deconvolutional Generative Model\n\n2.1\nConsider N images {X(n)}N\nn=1, with X(n) \u2208 RNx\u00d7Ny\u00d7Nc; Nx and Ny represent the number of\npixels in each spatial dimension, and Nc denotes the number of color bands in the image (Nc = 1 for\ngray-scale images and Nc = 3 for RGB images).\nTo introduce the image decoder (generative model) in its simplest form, we \ufb01rst consider a decoder\nwith L = 2 layers. The code {S(n,k2,2)}K2\nk2=1 feeds the decoder at the top (layer 2), and at the bottom\n(layer 1) the image X(n) is generated:\n\nX(n) \u223c N (\u02dcS(n,1), \u03b1\u22121\n\n0 I)\n\nS(n,1) \u223c unpool(\u02dcS(n,2))\n\nk1=1 D(k1,1) \u2217 S(n,k1,1)\n\nLayer 2:\nUnpool:\nLayer 1:\nData Generation:\n\n(1)\n(2)\n(3)\n(4)\nEquation (4) is meant to indicate that E(X(n)) = \u02dcS(n,1), and each element of X(n) \u2212 E(X(n)) is iid\nzero-mean Gaussian with precision \u03b10.\nConcerning notation, expressions with two superscripts, D(kl,l), S(n,l) and \u02dcS(n,l), for layer l \u2208 {1, 2}\nand image n \u2208 {1, . . . , N}, are 3D tensors. Expressions with three superscripts, S(n,kl,l), are 2D\nactivation maps, representing the klth \u201cslice\u201d of 3D tensor S(n,l); S(n,kl,l) is the spatially-dependent\nactivation map for image n, dictionary element kl \u2208 {1, . . . , Kl}, at layer l of the model. Tensor S(n,l)\nis formed by spatially aligning and \u201cstacking\u201d the {S(n,kl,l)}Kl\nkl=1. Convolution D(kl,l) \u2217 S(n,kl,l)\nbetween 3D D(kl,l) and 2D S(n,kl,l) indicates that each of the Kl\u22121 2D \u201cslices\u201d of D(kl,l) is\nconvolved with the spatially-dependent S(n,kl,l); upon aligning and \u201cstacking\u201d these convolutions, a\ntensor output is manifested for D(kl,l) \u2217 S(n,kl,l) (that tensor has Kl\u22121 2D slices).\nAssuming dictionary elements {D(kl,l)} are known, along with the precision \u03b10. We now discuss\nthe generative process of the decoder. The layer-2 activation maps {S(n,k2,2)}K2\nk2=1 are the code\nthat enters the decoder. Activation map S(n,k2,2) is spatially convolved with D(k2,2), yielding a 3D\ntensor; summing over the K2 such tensors manifested at layer-2 yields the pooled 3D tensor \u02dcS(n,2).\nStochastic unpooling (discussed below) is employed to go from \u02dcS(n,2) to S(n,1). Slice k1 of S(n,1),\nS(n,k1,1), is convolved with D(k1,1), and summing over k1 yields E(X(n)).\nFor the stochastic unpooling, S(n,k1,1) is partitioned into contiguous px \u00d7 py pooling blocks (analo-\n\u2208 {0, 1}pxpy be a vector\ngous to pooling blocks in CNN-based activation maps [1]). Let z(n,k1,1)\nof pxpy \u2212 1 zeros, and a single one; z(n,k1,1)\ncorresponds to pooling block (i, j) in S(n,k1,1). The\n\ni,j\n\n\u02dcS(n,2) =(cid:80)K2\n\u02dcS(n,1) =(cid:80)K1\n\nk2=1 D(k2,2) \u2217 S(n,k2,2)\n\ni,j\n\n2\n\n\fi,j\n\ni,j\n\ni,j\n\nlocation of the non-zero element of z(n,k1,1)\nidenti\ufb01es the location of the single non-zero element\nin the corresponding pooling block of S(n,k1,1). The non-zero element in pooling block (i, j) of\nS(n,k1,1) is set to \u02dcS(n,k1,2)\n, i.e., element (i, j) in slice k1 of \u02dcS(n,2). Within the prior of the decoder,\n\u223c Mult(1; 1/(pxpy), . . . , 1/(pxpy)). Both \u02dcS(n,2) and S(n,2) are 3D tensors with\nwe impose z(n,k1,1)\nK1 2D slices; as a result of the unpooling, the 2D slices in the sparse S(n,2) have pxpy times more\nelements than the corresponding slices in the dense \u02dcS(n,2).\nThe above model may be replicated to constitute L > 2 layers. The decoder is represented concisely\nas p\u03b1(X|s, z), where vector s denotes the \u201cunwrapped\u201d set of top-layer features {S(\u00b7,kL,L)}, and\nvector z denotes the unpooling maps at all L layers. The model parameters \u03b1 are the set of dictionary\nelements at the L layers, as well as the precision \u03b10. The prior over the code is p(s) = N (0, I).\n2.2\nTo make explicit the connection between the proposed CNN-based encoder and the above decoder,\nwe also initially illustrate the encoder with an L = 2 layer model. While the two-layer decoder in\n(1)-(4) is top-down, starting at layer 2, the encoder is bottom-up, starting at layer 1 with image X(n):\n\nImage Encoder: Deep CNN\n\nLayer 1:\nPool:\nLayer 2:\n\nCode Generation:\n\n\u02dcC(n,k1,1) = X(n) \u2217s F(k1,1) , k1 = 1, . . . , K1\nC(n,1) \u223c pool( \u02dcC(n,1))\n\u02dcC(n,k2,2) = C(n,1) \u2217s F(k2,2) , k2 = 1, . . . , K2\n\nsn \u223c N(cid:16)\n\n\u00b5\u03c6( \u02dcC(n,2)), diag(\u03c32\n\n\u03c6( \u02dcC(n,2)))\n\n(cid:17)\n\n(5)\n(6)\n(7)\n\n(8)\n\nk2=1 are aligned and \u201cstacked\u201d to manifest \u02dcC(n,2).\n\nImage X(n) and \ufb01lter F(k1,1) are each tensors, composed of Nc stacked 2D images (\u201cslices\u201d). To\nimplement X(n) \u2217s F(k1,1), the respective spatial slices of X(n) and F(k1,1) are convolved; the results\nof the Nc convolutions are aligned spatially and summed, yielding a single 2D spatially-dependent\n\ufb01lter output \u02dcC(n,k1,1) (hence notation \u2217s, to distinguish \u2217 in (1)-(4)).\nThe 2D maps { \u02dcC(n,k1,1)}K1\nk1=1 are aligned spatially and \u201cstacked\u201d to constitute the 3D tensor \u02dcC(n,1).\nEach contiguous px \u00d7 py pooling region in \u02dcC(n,1) is stochastically pooled to constitute C(n,1); the\nposterior pooling statistics in (6) are detailed below. Finally, the pooled tensor C(n,1) is convolved\nwith K2 layer-2 \ufb01lters {F(k2,2)}K2\nk2=1, each of which yields the 2D feature map \u02dcC(n,k2,2); the K2\nfeature maps { \u02dcC(n,k2,2)}K2\nConcerning the pooling in (6), let \u02dcC(n,k1,1)\nre\ufb02ect the pxpy components in pooling block (i, j) of\n\u02dcC(n,k1,1). Using a multi-layered perceptron (MLP), this is mapped to the pxpy-dimensional real vec-\ntor \u03b7(n,k1,1)\n.\nThe pooling vector is drawn z(n,k1,1)\n)); as a recognition model,\nMult(1; Softmax(\u03b7(n,k1,1)\n)) is also treated as the posterior distribution for the DGDN unpooling in\n(2). Similarly, to constitute functions \u00b5\u03c6( \u02dcC(n,2)) and \u03c32\n\u03c6( \u02dcC(n,2)) in (8), each layer of \u02dcC(n,2) is fed\nthrough a distinct MLP. Details are provided in the Supplementary Material (SM).\nParameters \u03c6 of q\u03c6(s, z|X) correspond to the \ufb01lter banks {F(kl,l)}, as well as the parameters of\nthe MLPs. The encoder is a CNN (yielding fast testing), utilized in a novel manner to manifest a\nposterior distribution on the parameters of the decoder. As discussed in Section 4, the CNN is trained\nin a novel manner, allowing semi-supervised and even unsupervised CNN learning.\n\n\u223c Mult(1; Softmax(\u03b7(n,k1,1)\n\n= W1h, with h = tanh\n\n), de\ufb01ned as \u03b7(n,k1,1)\n\n= MLP( \u02dcC(n,k1,1)\n\n(cid:16)\n\nW2vec( \u02dcC(n,k1,1)\n\n)\n\ni,j\n\n(cid:17)\n\ni,j\n\ni,j\n\ni,j\n\ni,j\n\ni,j\n\ni,j\n\ni,j\n\n3 Leveraging Labels and Captions\n\n3.1 Generative Model for Labels: Bayesian SVM\nAssume a label (cid:96)n \u2208 {1, . . . , C} is associated with training image X(n); in the discussion that\nfollows, labels are assumed available for each image (for notational simplicity), but in practice only a\nsubset of the N training images need have labels. We design C one-versus-all binary SVM classi\ufb01ers\n\n3\n\n\fn=1.\n\nn }N\n\nn=1, with y((cid:96))\n\nn = 1, and y((cid:96))\n\nn \u2208 {\u22121, 1}. If (cid:96)n = (cid:96) then y((cid:96))\n\n[16], responsible for mapping top-layer image features sn to label (cid:96)n; sn is the same image code as\nin (8), from the top DGDN layer. For the (cid:96)-th classi\ufb01er, with (cid:96) \u2208 {1, . . . , C}, the problem may be\nposed as training with {sn, y((cid:96))\nn = \u22121\notherwise. Henceforth we consider the Bayesian SVM for each one of the binary learning tasks, with\nlabeled data {sn, yn}N\nGiven a feature vector s, the goal of the SVM is to \ufb01nd an f (s) that minimizes the objective function\nn=1 max(1 \u2212 ynf (sn), 0) + R(f (s)), where max(1\u2212 ynf (sn), 0) is the hinge loss, R(f (s)) is\na regularization term that controls the complexity of f (s), and \u03b3 is a tuning parameter controlling the\ntrade-off between error penalization and the complexity of the classi\ufb01cation function. Recently, [12]\nshowed that for the linear classi\ufb01er f (s) = \u03b2T s, minimizing the SVM objective function is equivalent\nn=1 L(yn|sn, \u03b2, \u03b3)p(\u03b2|\u00b7),\nwhere y = [y1 . . . yN ]T , S = [s1 . . . sN ], L(yn|sn, \u03b2, \u03b3) is the pseudo-likelihood function, and\np(\u03b2|\u00b7) is the prior distribution for the vector of coef\ufb01cients \u03b2. In [12] it was shown that L(yn|sn, \u03b2, \u03b3)\nadmits a location-scale mixture of normals representation by introducing latent variables \u03bbn:\n\n\u03b3(cid:80)N\nto estimating the mode of the pseudo-posterior of \u03b2: p(\u03b2|S, y, \u03b3) \u221d(cid:81)N\nL(yn|sn, \u03b2, \u03b3) = e\u22122\u03b3 max(1\u2212yn\u03b2T sn,0) =(cid:82) \u221e\n\n(cid:16)\u2212 (1+\u03bbn\u2212yn\u03b2T sn)2\n\nd\u03bbn.\n\n(9)\n\n\u221a\n\u03b3\u221a\n\n(cid:17)\n\nexp\n\n2\u03b3\u22121\u03bbn\n\n0\n\n2\u03c0\u03bbn\n\nNote that (9) is a mixture of Gaussian distributions w.r.t. random variable yn\u03b2T sn, where the mixture\nis formed with respect to \u03bbn, which controls the mean and variance of the Gaussians. This encourages\ndata augmentation for variable \u03bbn , permitting ef\ufb01cient Bayesian inference (see [12, 17] for details).\nParameters {\u03b2(cid:96)}C\n(cid:96)=1 for the C binary SVM classi\ufb01ers are analogous to the fully connected parameters\nof a softmax classi\ufb01er connected to the top of a traditional CNN [2]. If desired, the pseudo-likelihood\nof the SVM-based classi\ufb01er can be replaced by a softmax-based likelihood. In Section 5 we compare\nperformance of the SVM and softmax based classi\ufb01ers.\n\n3.2 Generative Model for Captions\n\nt\n\nt\n\nt\n\nt\n\nt=2 p(y(n)\n\n1 , . . . , y(n)\nTn\n\nt = Wey(n)\n\n1 |sn)(cid:81)Tn\n\n, is embedded into an M-dimensional vector w(n)\n\n|y(n)\n