{"title": "Separating Style and Content", "book": "Advances in Neural Information Processing Systems", "page_first": 662, "page_last": 668, "abstract": null, "full_text": "Separating Style and Content \n\nJoshua B. Tenenbaum \n\nDept. of Brain and Cognitive Sciences \nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \njbtGpsyche.mit.edu \n\nWilliam T. Freeman \n\nMERL, Mitsubishi Electric Res. Lab. \n\n201 Broadway \n\nCambridge, MA 02139 \n\nfreemanOmerl.com \n\nAbstract \n\nWe seek to analyze and manipulate two factors, which we call style \nand content, underlying a set of observations. We fit training data \nwith bilinear models which explicitly represent the two-factor struc(cid:173)\nture. These models can adapt easily during testing to new styles or \ncontent, allowing us to solve three general tasks: extrapolation of a \nnew style to unobserved content; classification of content observed \nin a new style; and translation of new content observed in a new \nstyle. For classification, we embed bilinear models in a probabilistic \nframework, Separable Mixture Models (SMMsj, which generalizes \nearlier work on factorial mixture models [7, 3]. Significant per(cid:173)\nformance improvement on a benchmark speech dataset shows the \nbenefits of our approach. \n\n1 \n\nIntroduction \n\nIn many pattern analysis or synthesis tasks, the observed data are generated from \nthe interaction of two underlying factors which we will generically call \"style\" and \n\"content.\" For example, in a character recognition task, we might observe different \nletters in different fonts (see Fig. 1); with handwriting, different words in different \nwriting styles; with speech, different phonemes in different accents; with visual \nimages, the faces of different people under different lighting conditions. \n\nSuch data raises a number of learning problems. Extracting a hidden two-factor \nstructure given only the raw observations has received significant attention [7, 3], \nbut unsupervised factorial learning of this kind has yet to prove tractable for our \nfocus: real-world data with subtly interacting factors. We work in a more supervised \nsetting, where labels for style or content may be available during training or testing. \nFigure 1 shows three problems we want to solve. Given a labelled training set of \nobservations in multiple styles, we want to extrapolate a new style to unobserved \ncontent classes (Fig. la), classify content observed in a new style (Fig. Ib), and \ntranslate new content observed in a new style (Fig. lc). \nThis paper treats these problems in a common framework, by fitting the training \ndata with a separable model that can easily adapt during testing to new styles or \ncontent classes. We write an observation vector in style s and content class c as \ny3C. We seek to fit these observations with some model \n\n(1) \nwhere a particular functional form of f is assumed. We must estimate parameter \nvectors a 3 and b C describing style s and content c, respectively, and W, parameters \n\ny3C = f(a 3 , bC; W), \n\n\fSeparating Style and Content \n\n663 \n\nExtrapolation \n\nClassification \n\nTranslation \n\nH B [ 0 E ) \n\nJl '.8 C 'l> \n'\u00a3 \nA B C D E \nA r C II E. \nA B C D E \nA B C ? ? \n\nI \n\nH B I: 0 E \nJl 'B C 1J) \n'\u00a3 \nA B C D E \nA r c II e \nAB C D E \nED CB A \n\nV \n\nI \n\nR B [ 0 E ? ? ? \nJl '.8 C'D '\u00a3 \nI \nA B C D E I \nI \nA J C n E. \nI \nI \nA BC D E ? ? ? \n? - - r- ? F G H \n\n(a) \n\n(b) \n\n(c) \n\nFigure 1: Given observations of content (letters) in different styles (fonts), we want to \nextrapolate, classify, and translate observations from a new style or content class. \n\nfor f that are independent of style and content but govern their interaction. In terms \nof Fig. 1 (and in the spirit of [8]), the model represents what the elements of each \nrow have in common independent of column (a&), what the elements of each column \nhave in common independent of row (b C ), and what all elements have in common \nindependent of row and column (W). With these three modular components, we \ncan solve problems like those illustrated in Fig. 1. For example, we can extrapolate \na new style to unobserved content classes (Fig. 1a) by combining content and \ninteraction parameters learned during training with style parameters estimated from \navailable data in the new style. \n\n2 Bilinear models \n\nWe propose to separate style and content using bilinear models - two-factor models \nthat are linear in either factor when the other is held constant. These simple \nmodels are still complex enough to model subtle interactions of style and content. \nThe empirical success of linear models in many pattern recognition applications \nwith single-factor data (e.g. \"eigenface\" models of faces under varying identity \nbut constant illumination and pose [15], or under varying illumination but constant \nidentity and pose [5] ), makes bilinear models a natural choice when two such factors \nvary independently across the data set. Also, many of the computationally desirable \nproperties of linear models extend to bilinear models. Model fitting (discussed \nin Section 3 below) is easy, based on efficient and well-known techniques such as \nthe singular value decomposition (SVD) and the expectation-maximization (EM) \nalgorithm. Model complexity can be controlled by varying model dimensionality to \nachieve a compromise between reproduction of the training data and generalization \nduring testing. Finally, the approach extends to multilinear models [10], for data \ngenerated by three or more interacting factors. \n\nWe have explored two bilinear models for Eq. 1. In the symmetric model (so called \nbecause it treats the two factors symmetrically), we assume f is a bilinear mapping \ngiven by \n\nyi/ = a&TWkb C = LaibjWijk. \n\nij \n\n(2) \n\nThe Wij k parameters represent a set of basis functions independent of style and con(cid:173)\ntent, which characterize the interaction between these two factors. Observations in \nstyle s and content c are generated by mixing these basis functions with coefficients \n\n\f664 \n\n1. B. Tenenbaum and W T. Freeman \n\ngiven by the tensor product of a 3 and be vectors. The model exactly reproduces the \nobservations when the dimensionalities of a 3 and be equal the number of styles N3 \nand content classes Ne observed. It finds coarser but more compact representations \nas these dimensionalities are decreased. \n\nSometimes it may not be practical to represent both style and content with low(cid:173)\ndimensional vectors. For example, a linear combination of a few basis styles learned \nduring training may not describe new styles well. We can obtain more flexible, \nasymmetric bilinear models by letting the basis functions Wijk themselves depend \non style or content. For example, if the basis functions are allowed to depend on \nstyle, the bilinear model from Eq. 2 becomes Yi/ = Lij at bj W/j k' This simplifies \nto Yi/ = Lj Aj kbj, by summing out the i index and identifying Aj k == Li at W/j k' \nIn vector notation, we have \n\n(3) \nwhere A 3 is a matrix of basis functions specific to style s (independent of con(cid:173)\ntent), and be is a vector of coefficients specific to content c (independent of style). \nAlternatively, the basis functions may depend on content, which gives \n\ny3C = A 3 b c , \n\n(4) \n\nAsymmetric models do not parameterize the rendering function f independently of \nstyle and content, and so cannot translate across both factors simultaneously (Fig. \nlc). Further, a matrix representation of style or content may be too flexible and \noverfit the training data. But if overfitting is not a problem or can be controlled \nby some additional constraint, asymmetric models may solve extrapolation and \nclassification tasks using less training data than symmetric models. \n\nFigure 2 illustrates an example of an asymmetric model used to separate style and \ncontent . We have collected a small database offace images, with 11 different people \n(content classes) in 15 different head poses (styles). The images are 22 x 32 pixels, \nwhich we treat as 704-dimensional vectors. A subset of the data is shown in Fig. 2a. \nFig. 2b schematically depicts an asymmetric bilinear model of the data, with each \npose represented by a set of basis vectors Apose (shown as images) and each person \nrepresented by a set of coefficients bperson. To render an image of a particular \nperson in a particular pose, the pose-specific basis vectors are mixed according \nto the person-specific coefficients. Note that the basis vectors for each pose look \nlike eigenfaces [15] in the appropriate style of each pose. However, the bilinear \nstructure of the model ensures that corresponding basis vectors play corresponding \nroles across poses (e.g. the first vector holds (roughly) the mean face for that pose, \nthe second may modulate overall head size, the third may modulate head thickness, \netc.), which is crucial for adapting to new styles or content classes. \n\n3 Model fitting \n\nAll the tasks shown in Fig. 1 break down into a training phase and a testing phase; \nboth involve some model fitting. In the training phase (corresponding to the first \n5 rows and columns of Figs. la-c), we learn all the parameters of a bilinear model \nfrom a complete matrix of observations of Nc content classes in N3 styles. In the \ntesting phase (corresponding to the final rows of Figs. la, b and the final row and \nlast 3 columns of Fig. lc), we adapt the same model to data in a new style or content \nclass (or both), estimating new parameters for the new style or content, clamping \nthe other parameters. Then new and old parameters are combined to accomplish \nthe desired classification, extrapolation, or translation task . This section focuses on \nthe asymmetric model and its use in extrapolation and classification. Training and \n\n\fSeparating Style and Content \n\n665 \n\nFaces rendered from: \ny = APose lJPerson \n\nAUp-Right \n\nA Up-Straight \n\nA Up-Left \n\n(a) \n\n(b) \n\nFigure 2: An illustraton of the asymmetric model, with faces varying in identity and head \npose. \n\nadaptation procedures for the symmetric model are similar and based on algorithms \nin [10, 11]. In [2], we describe these procedures and their application to extrapolation \nand translation tasks. \n\n3.1 Training \nLet nSC be the number of observations in style s and content c, and let mSc = L: y8C \nbe the sum of these observations. Then estimates of AS and b C that minimize the \nsum-of-squared-errors for the asymmetric model in Eq. 3 can be found by iterating \nthe fixed point equations \n\nobtained by setting derivatives of the error equal to O. To ensure stability, we update \nthe parameters according to AS = (l-\"7)As+\"7As and b C = (l-\"7)bc+\"7bc, typically \nusing a stepsize 0.2 < \"7 < 0.5. Replacing AS with B C and b C with as yields the \n.analogous procedure for training the model in Eq. 4. \nIf the same number of observations are available for all style-content pairs, there \nexists a closed-form procedure to fit the asymmetric model using the SVD. Let the \nK-dimensional vector y8C denote the mean of the observed data generated by style \ns and content c, and stack these vectors into a single (K x N s ) x Nc matrix \n\ny= \n\n(6) \n\nWe compute the SVD ofY = USVT, and define the (K x N s ) x J matrix A to be \nthe first J columns of U, and the J x Nc matrix B to be the first J rows of SVT . \nFinally, we identify A and B as the desired parameter estimates in stacked form \n\n\f666 \n\n(see also [9, 14]), \n\nJ. B. Tenenbaum and W T. Freeman \n\n(7) \n\nThe model dimensionality J can be chosen in various standard ways: by a priori \nconsiderations, by requiring a sufficiently good approximation to the data (as mea(cid:173)\nsured by mean squared error or some more subjective metric), or by looking for a \ngap in the singular value spectrum. \n\n3.2 Testing \n\nIt is straightforward to adapt the asymmetric model to an incomplete new style s* , \nin order to extrapolate that style to unseen content. We simply estimate A 3 \nfrom \nEq. 5, using b C values learned during training and restricting the sums over e to \nthose content classes observed in the new style. Then data in content e and style s* \ncan be synthesized from A $- b C \u2022 Extrapolating incomplete new content to unseen \nstyles is done similarly. \nAdapting the asymmetric model for classification in new styles is more involved, \nbecause the content class of the new data (and possibly its style as well) is unla(cid:173)\nbeled. To deal with this uncertainty, we embed the bilinear model within a gaussian \nmixture model to yield a separable mixture model (SMM), which can then be fit ef(cid:173)\nficiently to data in new styles using the EM algorithm. Specifically, we assume that \nthe probability of a new, unlabeled observation y being generated by style sand \ncontent c is given by a spherical gaussian centered at the prediction of the asymmet(cid:173)\nric bilinear model: p(Yls, e)