{"title": "Learning to Learn with Compound HD Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2061, "page_last": 2069, "abstract": "We introduce HD (or ``Hierarchical-Deep'') models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets.", "full_text": "Learning to Learn with Compound HD Models\n\nRuslan Salakhutdinov\n\nDepartment of Statistics, University of Toronto\n\nrsalakhu@utstat.toronto.edu\n\nJoshua B. Tenenbaum\n\nBrain and Cognitive Sciences, MIT\n\njbt@mit.edu\n\nAntonio Torralba\n\nCSAIL, MIT\n\ntorralba@mit.edu\n\nAbstract\n\nWe introduce HD (or \u201cHierarchical-Deep\u201d) models, a new compositional learn-\ning architecture that integrates deep learning models with structured hierarchical\nBayesian models. Speci\ufb01cally we show how we can learn a hierarchical Dirichlet\nprocess (HDP) prior over the activities of the top-level features in a Deep Boltz-\nmann Machine (DBM). This compound HDP-DBM model learns to learn novel\nconcepts from very few training examples, by learning low-level generic features,\nhigh-level features that capture correlations among low-level features, and a cat-\negory hierarchy for sharing priors over the high-level features that are typical of\ndifferent kinds of concepts. We present ef\ufb01cient learning and inference algorithms\nfor the HDP-DBM model and show that it is able to learn new concepts from very\nfew examples on CIFAR-100 object recognition, handwritten character recogni-\ntion, and human motion capture datasets.\n\nIntroduction\n\n1\n\u201cLearning to learn\u201d, or the ability to learn abstract representations that support transfer to novel\nbut related tasks, lies at the core of many problems in computer vision, natural language processing,\ncognitive science, and machine learning. In typical applications of machine classi\ufb01cation algorithms\ntoday, learning curves are measured in tens, hundreds or thousands of training examples. For humans\nlearners, however, just one or a few examples are often suf\ufb01cient to grasp a new category and make\nmeaningful generalizations to novel instances [25, 16]. The architecture we describe here takes a\nstep towards this \u201cone-shot learning\u201d ability by learning several forms of abstract knowledge that\nsupport transfer of useful representations from previously learned concepts to novel ones.\nWe call our architectures compound HD models, where \u201cHD\u201d stands for \u201cHierarchical-Deep\u201d, be-\ncause they are derived by composing hierarchical nonparametric Bayesian models with deep net-\nworks, two in\ufb02uential approaches from the recent unsupervised learning literature with comple-\nmentary strengths. Recently introduced deep learning models, including Deep Belief Networks [5],\nDeep Boltzmann Machines [14], deep autoencoders [10], and others [12, 11], have been shown to\nlearn useful distributed feature representations for many high-dimensional datasets. The ability to\nautomatically learn in multiple layers allows deep models to construct sophisticated domain-speci\ufb01c\nfeatures without the need to rely on precise human-crafted input representations, increasingly im-\nportant with the proliferation of data sets and application domains.\nWhile the features learned by deep models can enable more rapid and accurate classi\ufb01cation learn-\ning, deep networks themselves are not well suited to one-shot learning of novel classes. All units\nand parameters at all levels of the network are engaged in representing any given input and are ad-\njusted together during learning. In contrast, we argue that one-shot learning of new classes will be\neasier in architectures that can explicitly identify only a small number of degrees of freedom (latent\nvariables and parameters) that are relevant to the new concept being learned, and thereby achieve\nmore appropriate and \ufb02exible transfer of learned representations to new tasks. This ability is the\n\n1\n\n\fhallmark of hierarchical Bayesian (HB) models, recently proposed in computer vision, statistics,\nand cognitive science [7, 25, 4, 13] for learning to learn from few examples. Unlike deep networks,\nthese HB models explicitly represent category hierarchies that admit sharing the appropriate ab-\nstract knowledge about the new class\u2019s parameters via a prior abstracted from related classes. HB\napproaches, however, have complementary weaknesses relative to deep networks. They typically\nrely on domain-speci\ufb01c hand-crafted features [4, 1] (e.g. GIST, SIFT features in computer vision,\nMFCC features in speech perception domains). Committing to the a-priori de\ufb01ned feature repre-\nsentations, instead of learning them from data, can be detrimental. Moreover, many HB approaches\noften assume a \ufb01xed hierarchy for sharing parameters [17, 3] instead of learning the hierarchy in an\nunsupervised fashion.\nIn this work we investigate compound HD (hierarchical-deep) architectures that integrate these deep\nmodels with structured hierarchical Bayesian models. In particular, we show how we can learn a hi-\nerarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltz-\nmann Machine (DBM), coming to represent both a layered hierarchy of increasingly abstract fea-\ntures, and a tree-structured hierarchy of classes. Our model depends minimally on domain-speci\ufb01c\nrepresentations and achieves state-of-the-art one-shot learning performance by unsupervised discov-\nery of three components: (a) low-level features that abstract from the raw high-dimensional sensory\ninput (e.g. pixels, or 3D joint angles); (b) high-level part-like features that express the distinctive\nperceptual structure of a speci\ufb01c class, in terms of class-speci\ufb01c correlations over low-level fea-\ntures; and (c) a hierarchy of super-classes for sharing abstract knowledge among related classes. We\nevaluate the compound HDP-DBM model on three different perceptual domains. We also illustrate\nthe advantages of having a full generative model, extending from highly abstract concepts all the\nway down to sensory inputs: we can not only generalize class labels but also synthesize new exam-\nples in novel classes that look reasonably natural, and we can signi\ufb01cantly improve classi\ufb01cation\nperformance by learning parameters at all levels jointly by maximizing a joint log-probability score.\n\n(cid:88)\n\nh\n\n(cid:18)(cid:88)\n\nij\n\n(cid:88)\n\njl\n\n(cid:19)\n\n(cid:88)\n\nlm\n\n2 Deep Boltzmann Machines (DBMs)\nA Deep Boltzmann Machine is a network of symmetrically coupled stochastic binary units. It con-\ntains a set of visible units v \u2208 {0, 1}D, and a sequence of layers of hidden units h1 \u2208 {0, 1}F1,\nh2 \u2208 {0, 1}F2,..., hL \u2208 {0, 1}FL. There are connections only between hidden units in adjacent\nlayers, as well as between visible and hidden units in the \ufb01rst hidden layer. Consider a DBM with\nthree hidden layers1 (i.e. L = 3). The probability of a visible input v is:\n\nP (v; \u03c8) =\n\n1\n\nZ(\u03c8)\n\nexp\n\nW(1)\n\nij vih1\n\nj +\n\nW(2)\n\njl h1\n\nj h2\n\nl +\n\nW(2)\n\nlm h2\n\nl h3\nm\n\n,\n\n(1)\n\n(cid:81)F3\n\n(cid:81)F2\n\nunits: Q(h|v; \u00b5) = (cid:81)F1\n\nwhere h = {h1, h2, h3} are the set of hidden units, and \u03c8 = {W(1), W(2), W(3)} are the model\nparameters, representing visible-to-hidden and hidden-to-hidden symmetric interaction terms.\nApproximate Learning: Exact maximum likelihood learning in this model is intractable, but ef\ufb01-\ncient approximate learning of DBMs can be carried out by using a mean-\ufb01eld inference to estimate\ndata-dependent expectations, and an MCMC based stochastic approximation procedure to approx-\nimate the model\u2019s expected suf\ufb01cient statistics [14]. In particular, consider approximating the true\nposterior P (h|v; \u03c8) with a fully factorized approximating distribution over the three sets of hidden\nj|v)q(h2\nm|v) where \u00b5 = {\u00b51, \u00b52, \u00b53} are the\nk=1\ni for l = 1, 2, 3. In this case, we can write down the\nmean-\ufb01eld parameters with q(hl\nvariational lower bound on the log-probability of the data, which takes a particularly simple form:\nW(2)\u00b52 + \u00b52(cid:62)\n\nm=1 q(h1\ni = 1) = \u00b5l\nlog P (v; \u03c8) \u2265 v(cid:62)W(1)\u00b51 + \u00b51(cid:62)\n\n(2)\nwhere H(\u00b7) is the entropy functional. Learning proceeds by \ufb01nding the value of \u00b5 that maximizes\nthis lower bound for the current value of model parameters \u03c8, which results in a set of the mean-\ufb01eld\n\ufb01xed-point equations. Given the variational parameters \u00b5, the model parameters \u03c8 are then updated\nto maximize the variational bound using stochastic approximation (for details see [14, 22, 26]).\nMultinomial DBMs: To allow DBMs to express more information and introduce more structured\nhierarchical priors, we will use a conditional multinomial distribution to model activities of the top-\nlevel units. Speci\ufb01cally, we will use M softmax units, each with \u201c1-of-K\u201d encoding (so that each\n\nW(3)\u00b52 \u2212 log Z(\u03c8) + H(Q),\n\nk|v)q(h3\n\nj=1\n\n1For clarity, we use three hidden layers. Extensions to models with more than three layers is trivial.\n\n2\n\n\fDeep Boltzmann Machine\n\nM replicated softmax units\n\nwith tied weights\n\nMultinomial unit\nsampled M times\n\nHDP prior\nover activities of\nthe top-level units\n\n\u201cAnimal\u201d\n\nLearned Hierarchy\nof super-classes\n\n\u201cVehicle\u201d\n\nFigure 1: Left: Multinomial DBM model: the top layer represents M softmax hidden units h3, which share the\nsame set of weights. Middle: A different interpretation: M softmax units are replaced by a single multinomial\nunit which is sampled M times. Right: Hierarchical Dirichlet Process prior over the states of h3.\nunit contains a set of K weights). All M separate softmax units will share the same set of weights,\nconnecting them to binary hidden units at the lower-level (Fig. 1). A key observation is that M\nseparate copies of softmax units that all share the same set of weights can be viewed as a single\nmultinomial unit that is samples M times [15, 19]. A pleasing property of using softmax units is that\nthe mathematics underlying the learning algorithm for binary-binary DBMs remains the same.\n\n3 Compound HDP-DBM model\nAfter a DBM model has been learned, we have an undirected model that de\ufb01nes the joint dis-\ntribution P (v, h1, h2, h3). One way to express what has been learned is the conditional model\nP (v, h1, h2|h3) and a prior term P (h3). We can therefore rewrite the variational bound as:\n\nQ(h|v; \u00b5) log P (v, h1, h2|h3) + H(Q) +\n\nQ(h3|v; \u00b5) log P (h3). (3)\n\nh1,h2,h3\n\nh3\n\nlog P (v) \u2265 (cid:88)\n\n(cid:88)\n\nThis particular decomposition lies at the core of the greedy recursive pretraining algorithm: we keep\nthe learned conditional model P (v, h1, h2|h3), but maximize the variational lower-bound of Eq. 3\nwith respect to the last term [5]. Instead of adding an additional undirected layer, (e.g. a restricted\nBoltzmann machine), to model P (h3), we can place a hierarchical Dirichlet process prior over\nh3, that will allow us to learn category hierarchies, and more importantly, useful representations\nof classes that contain few training examples. The part we keep, P (v, h1, h2|h3), represents a\nconditional DBM model, which can be viewed as a two-layer DBM but with bias terms given by the\nstates of h3:\nP (v, h1, h2|h3) =\n\n(cid:18)(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:19)\n\nexp\n\n(4)\n\n1\n\n.\n\nW(1)\n\nij vih1\n\nj +\n\nW(2)\n\njl h1\n\nj h2\n\nl +\n\nW(3)\n\nlm h2\n\nl h3\nm\n\nZ(\u03c8, h3)\n\nij\n\njl\n\nlm\n\n3.1 A Hierarchical Bayesian Topic Prior\nIn a typical hierarchical topic model, we observe a set of N documents, each of which is modeled\nas a mixture over topics, that are shared among documents. Let there be K words in the vocabulary.\nA topic t is a discrete distribution over K words with probability vector \u03c6t. Each document n has\nits own distribution over topics given by probabilities \u03b8n.\nIn our compound HDP-DBM model, we will use a hierarchical topic model as a prior over the\nactivities of the DBM\u2019s top-level features. Speci\ufb01cally, the term \u201cdocument\u201d will refer to the top-\nlevel multinomial unit h3, and M \u201cwords\u201d in the document will represent the M samples, or active\nDBM\u2019s top-level features, generated by this multinomial unit. Words in each document are drawn\nby choosing a topic t with probability \u03b8nt, and then choosing a word w with probability \u03c6tw. We\nwill often refer to topics as our learned higher-level features, each of which de\ufb01nes a topic speci\ufb01c\ndistribution over DBM\u2019s h3 features. Let h3\n\nin be the ith word in document n, and xin be its topic:\n\n\u03b8n|\u03c0 \u223c Dir(\u03b1\u03c0), \u03c6t|\u03c4 \u223c Dir(\u03b2\u03c4 ), xin|\u03b8n \u223c Mult(\u03b8n), h3\n\n(5)\nwhere \u03c0 is the global distribution over topics, \u03c4 is the global distribution over K words, and \u03b1 and\n\u03b2 are concentration parameters.\n\nin|xin, \u03c6xin \u223c Mult(\u03c6xin),\n\n3\n\nh3h2h1vW2W1W3h3h2h1vW2W1W3\u03b1(2)\u03b1(3)\u03b3HG(1)cG(1)cG(1)cG(1)cG(1)cG(2)kG(2)kG(3)\u03b1(3)GnGnGnGnGn\u03c6in\u03c6in\u03c6in\u03c6in\u03c6inh3inh3inh3inh3inh3incNcowhorsecarvantruckMM\fLet us further assume that we are presented with a \ufb01xed two-level category hierarchy. Suppose that\nN documents, or objects, are partitioned into C basic level categories (e.g. cow, sheep, car). We\nn \u2208 {1, ..., C}. We\nrepresent such partition by a vector zb of length N, each entry of which is zb\nalso assume that our C basic-level categories are partitioned into S super-categories (e.g. animal,\nc \u2208 {1, ..., S}. These partitions de\ufb01ne a\nvehicle), represented by by a vector zs of length C, with zs\n\ufb01xed two-level tree hierarchy (see Fig. 1). We will relax this assumption later.\nThe hierarchical topic model can be readily extended to modeling the above hierarchy. For each\ndocument n that belong to the basic category c, we place a common Dirichlet prior over \u03b8n with\nparameters \u03c0(1)\n. The Dirichlet parameters \u03c0(1) are themselves drawn from a Dirichlet prior with\nparameters \u03c0(2), and so on (see Fig. 1). Speci\ufb01cally, we de\ufb01ne the following prior over h3:\n\nc\n\ng\n\nzs\nc\n\n\u223c Dir(\u03b1(3)\u03c03\ng),\n\u223c Dir(\u03b1(2)\u03c0(2)\n\u223c Dir(\u03b1(1)\u03c0(1)\n\ns |\u03c0(3)\n\u03c0(2)\n|\u03c0(2)\n\u03c0(1)\nc\n\u03b8n|\u03c0(1)\nxin|\u03b8n \u223c Mult(\u03b8n),\n\u03c6t|\u03c4 \u223c Dir(\u03b2\u03c4 ),\n\nzb\nn\n\nzb\nn\n\nzs\nc\n\nfor each super-category s=1,..,S\n\n(6)\n\nfor each basic-category c = 1, .., C\n\n),\n\n),\n\nfor each document n = 1, .., N\nfor each word i = 1, .., M\nin|xin, \u03c6xin \u223c Mult(\u03c6xin),\nh3\n\ng\n\nis the global distribution over topics, \u03c0(2)\n\nwhere \u03c0(3)\nis the\nclass speci\ufb01c distribution over topics, or higher-level features. These high-level features, in turn,\nde\ufb01ne topic-speci\ufb01c distribution over h3 features, or \u201cwords\u201d in a DBM model.\nFor a \ufb01xed number of topics T , the above model represents a hierarchical extension of LDA. We\ntypically do not know the number of topics a-priori. It is therefore natural to consider a nonparamet-\nric extension based on the HDP model [21], which allows for a countably in\ufb01nite number of topics.\nIn the standard hierarchical Dirichlet process notation, we have\n\nis the super-category speci\ufb01c and \u03c0(1)\n\ns\n\nc\n\ng \u223c DP(\u03b3, Dir(\u03b2\u03c4 )), G(2)\nG(3)\n\ns \u223c DP(\u03b1(3), G(3)\n\nGn \u223c DP(\u03b1(1), G(1)\n\n), \u03c6\u2217\n\nzb\nn\n\ng ), G(1)\nin|Gn \u223c Gn, h3\n\nc \u223c DP(\u03b1(2), G(2)\nin|\u03c6\u2217\nin \u223c Mult(\u03c6\u2217\nin),\n\nzs\nc\n\n),\n\n(7)\n\nc (\u03c6) =(cid:80)\u221e\n\ng (\u03c6) = (cid:80)\u221e\n\nct \u03b4\u03c6t, and Gn(\u03c6) =(cid:80)\u221e\n\nin. Making use of topic index variables xin, we denote \u03c6\u2217\nt=1 \u03c0(3)\n\nwhere Dir(\u03b2\u03c4 ) is the base-distribution, and each \u03c6\u2217 is a factor associated with a single observa-\ntion h3\nin = \u03c6xin (see Eq. 6). Using a\nstick-breaking representation we can write: G(3)\ngt \u03b4\u03c6t, G(2)\nst \u03b4\u03c6t,\nG(3)\nt=1 \u03b8nt\u03b4\u03c6t that represent sums of point masses. We also\nplace Gamma priors over concentration parameters as in [21].\nThe overall generative model is shown in Fig. 1. To generate a sample we \ufb01rst draw M words, or\nactivations of the top-level features, from the HDP prior over h3 given by Eq. 7. Conditioned on h3,\nwe sample the states of v from the conditional DBM model given by Eq. 4.\n\ns (\u03c6) = (cid:80)\u221e\n\nt=1 \u03c0(2)\n\nt=1 \u03c0(1)\n\n3.2 Modeling the number of super-categories\nSo far we have assumed that our model is presented with a two-level partition z = {zs, zb}. If,\nhowever, we are not given any level-1 or level-2 category labels, we need to infer the distribution\nover the possible category structures. We place a nonparametric two-level nested Chinese Restaurant\nPrior (CRP) [2] over z, which de\ufb01nes a prior over tree structures and is \ufb02exible enough to learn\narbitrary hierarchies. The main building block of the nested CRP is the Chinese restaurant process,\na distribution on partition of integers. Imagine a process by which customers enter a restaurant with\nan unbounded number of tables, where the nth customer occupies a table k drawn from:\n, if k is new},\n\nP (zn = k|z1, ..., zn\u22121) = {\n\n, if nk > 0;\n\n(8)\n\nnk\n\n\u03b7\n\nn \u2212 1 + \u03b7\n\nn \u2212 1 + \u03b7\n\nwhere nk is the number of previous customers at table k and \u03b7 is the concentration parameter. The\nnested CRP, nCRP(\u03b7), extends CRP to nested sequence of partitions, one for each level of the tree.\nIn this case each observation n is \ufb01rst assigned to the super-category zs\nn using Eq. 8. Its assignment\nto the basic-level category zb\nn, that is placed under a super-category zs\nn, is again recursively drawn\nfrom Eq. 8. We also place a Gamma prior \u0393(1, 1) over \u03b7. The proposed model allows for both: a\nnonparametric prior over potentially unbounded number of global topics, or higher-level features,\nas well as a nonparametric prior that allow learning an arbitrary tree taxonomy.\n\n4\n\n\fInference\n\n4\nInferences about model parameters at all levels of hierarchy can be performed by MCMC. When the\ntree structure z of the model is not given, the inference process will alternate between \ufb01xing z while\nsampling the space of model parameters, and vice versa.\nSampling HDP parameters: Given category assignment vectors z, and the states of the top-level\nDBM features h3, we use posterior representation sampler of [20]. In particular, the HDP sampler\nmaintains the stick-breaking weights {\u03b8}N\ng }; and topic indicator variables x\n(parameters \u03c6 can be integrated out). The sampler alternatives between: (a) sampling cluster indices\nxin using Gibbs updates in the Chinese restaurant franchise (CRF) representation of the HDP; (b)\nsampling the weights at all three levels conditioned on x using the usual posterior of a DP2.\nSampling category assignments z: Given current instantiation of the stick-breaking weights, using\na de\ufb01ning property of a DP, for each input n, we have:\n\nn=1, and {\u03c0(1)\n\ns , \u03c0(3)\n\n, \u03c0(2)\n\nc\n\n(\u03b81,n, ..., \u03b8T,n, \u03b8new,n) \u223c Dir(\u03b1(1)\u03c0(1)\n\nzn,1, ..., \u03b1(1)\u03c0(1)\n\nzn,T , \u03b1(1)\u03c0(1)\n\nzn,new)\n\n(9)\n\nCombining the above likelihood term with the CRP prior (Eq. 8), the posterior over the category\nassignment can be calculated as follows:\n\np(zn|\u03b8n, z\u2212n, \u03c0(1)) \u221d p(\u03b8n|\u03c0(1), zn)p(zn|z\u2212n),\n\n(10)\n\nwhere z\u2212n denotes variables z for all observations other than n. When computing the probability of\nplacing \u03b8n under a newly created category, its parameters are sampled from the prior.\nSampling DBM\u2019s hidden units: Given the states of the DBM\u2019s top-level multinomial unit h3, con-\nn|h3\nn, vn) can be obtained by running a Gibbs sampler that alternates\nditional samples from P (h1\nn, and vice versa. Conditioned on topic\nn independently given h2\nbetween sampling the states of h1\nn for each input n are sampled using\nassignments xin and h2\nGibbs conditionals:\nin|xin),\n\nn, the states of the multinomial unit h3\n\nn, h3\u2212in, xn) \u221d P (h2\n\nn)P (h3\n\nin|h2\n\nn|h3\n\nn, h2\n\nP (h3\n\n(11)\n\nwhere the \ufb01rst term is given by the product of logistic functions (see Eq. 4):\n\nP (h2|h3) =\n\nP (h2\n\nl |h3), with P (h2\n\nl = 1|h3) =\n\n(cid:89)\n\nl\n\n1 + exp(cid:0) \u2212(cid:80)\n\n1\n\nm W(3)\n\nlm h3\nm\n\n(cid:1) ,\n\n(12)\n\nin) is given by the multinomial: Mult(\u03c6xin ) (see Eq. 7, in our conjugate\n\nand the second term P (h3\nsetting, parameters \u03c6 can be further integrated out).\nFine-tuning DBM: More importantly, conditioned on h3, we can further \ufb01ne-tune low-level DBM\nparameters \u03c8 = {W(1), W(2), W(3)} by applying approximate maximum likelihood learning (see\nsection 2) to the conditional DBM model of Eq. 4. For the stochastic approximation algorithm, as\nthe partition function depends on the states of h3, we maintain one \u201cpersistent\u201d Markov chain per\ndata point (for details see [22, 14]).\nMaking predictions: Given a test input vt, we can quickly infer the approximate posterior over h3\nt\nusing the mean-\ufb01eld of Eq. 2, followed by running the full Gibbs sampler to get approximate samples\nfrom the posterior over the category assignments. In practice, for faster inference, we \ufb01x learned\nt belongs to category zt by assuming that\ntopics \u03c6t and approximate the marginal likelihood that h3\ndocument speci\ufb01c DP can be well approximated by the class-speci\ufb01c DP3 Gt \u2248 G(1)\nzt (see Fig. 1):\n(13)\n),\n\nt|\u03c6, Gt)P (Gt|G(1)\n\nt|zt, G(1), \u03c6) =\n\n)dGt \u2248 P (h3\n\nt|\u03c6, G(1)\n\nP (h3\n\nP (h3\n\n(cid:90)\n\nzt\n\nzt\n\nGt\n\nCombining this likelihood term with nCRP prior P (zt|z\u2212t) (Eq. 8) allows us to ef\ufb01ciently infer\napproximate posterior over category assignments4.\n\n2Conditioned on the draw of the super-class DP G(2)\n\ns\n\nand the state of the CRF, the posteriors over G(1)\n\nc\n\nbecome independent. We can easily speed up inference by sampling from these conditionals in parallel.\n\n3We note that G(1)\n4In all of our experimental results, computing this approximate posterior takes a fraction of a second.\n\nzt = E[Gt|G(1)\nzt ]\n\n5\n\n\fDBM features\n\n1st layer\n\n2nd layer\n\nHDP high-level features\n\n1.\n\n2.\n3.\n4.\n5.\n6.\n7.\n8.\n9.\n10.\n11.\n12\n13\n14\n15\n16\n17\n18\n20\n22\n\nbed, chair, clock, couch, dinosaur, lawn mower, table,\ntelephone, television, wardrobe\nbus, house, pickup truck, streetcar, tank, tractor, train\ncrocodile, kangaroo, lizard, snake, spider, squirrel\nhamster, mouse, rabbit, raccoon, possum, bear\napple, orange, pear, sun\ufb02ower, sweet pepper\nbaby, boy, girl, man, woman\ndolphin, ray, shark, turtle, whale\notter, porcupine, shrew, skunk\nbeaver, camel, cattle, chimpanzee, elephant\nfox, leopard, lion, tiger, wolf\nmaple tree, oak tree, pine tree, willow tree\n\ufb02at\ufb01sh, seal, trout, worm\nbutter\ufb02y, caterpillar, snail\nbee, crab, lobster\nbridge, castle, road, skyscraper\nbicycle, keyboard, motorcycle, orchid, palm tree\nbottle, bowl, can, cup, lamp\ncloud, plate, rocket 19. mountain, plain, sea\npoppy, rose, tulip\nbeetle, cockroach\n\n21. aquarium \ufb01sh, mushroom\n23. forest\n\nFigure 2: A random subset of the 1st, 2nd layer DBM features,\nand higher-level class-sensitive HDP features/topics.\n\nFigure 3: A typical partition of the 100\nbasic-level categories\n\n5 Experiments\nWe present experimental results on the CIFAR-100 [8], handwritten character [9], and human motion\ncapture recognition datasets. For all datasets, we \ufb01rst pretrain a DBM model in unsupervised fashion\non raw sensory input (e.g. pixels, or 3D joint angles), followed by \ufb01tting an HDP prior, which is run\nfor 200 Gibbs sweeps. We further run 200 additional Gibbs steps in order to \ufb01ne-tune parameters of\nthe entire compound HDP-DBM model. This was suf\ufb01cient to reach convergence and obtain good\nperformance. Across all datasets, we also assume that the basic-level category labels are given,\nbut no super-category labels are available. The training set includes many examples of familiar\ncategories but only a few examples of a novel class. Our goal is to generalize well on a novel class.\nIn all experiments we compare performance of HDP-DBM to the following alternative models:\nstand-alone Deep Boltzmann Machines, Deep Belief Networks [5], \u201cFlat HDP-DBM\u201d model, that\nalways uses a single super-category, SVMs, and k-NN. The Flat HDP-DBM approach could po-\ntentially identify a set of useful high-level features common to all categories. Finally, to evaluate\nperformance of DBMs (and DBNs), we follow [14]. Note that using HDPs on top of raw sensory in-\nput (i.e. pixels, or even image-speci\ufb01c GIST features) performs far worse compared to HDP-DBM.\n\n5.1 CIFAR-100 dataset\nThe CIFAR-100 image dataset [8] contains 50,000 training and 10,000 test images of 100 object\ncategories (100 per class), with 32 \u00d7 32 \u00d7 3 RGB pixels. Extreme variability in scale, viewpoint,\nillumination, and cluttered background makes object recognition task for this dataset quite dif\ufb01cult.\nSimilar to [8], in order to learn good generic low-level features, we \ufb01rst train a two-layer DBM in\ncompletely unsupervised fashion using 4 million tiny images5 [23]. We use a conditional Gaussian\ndistribution to model observed pixel values [8, 6]. The \ufb01rst DBM layer contained 10,000 binary\nhidden units, and the second layer contained M=1000 softmax units, each de\ufb01ning a distribution\nover 10, 000 second layer features6. We then \ufb01t an HDP prior over h2 to the 100 object classes.\nFig. 2 displays a random subset of the 1st and 2nd layer DBM features, as well as higher-level class-\nsensitive features, or topics, learned by the HDP model. To visualize a particular higher-level feature,\nwe \ufb01rst sample M words from a \ufb01xed topic \u03c6t, followed by sampling RGB pixel values from the\nconditional DBM model. While DBM features capture mostly low-level structure, including edges\nand corners, the HDP features tend to capture higher-level structure, including contours, shapes,\ncolor components, and surface boundaries. More importantly, features at all levels of the hierarchy\nevolve without incorporating any image-speci\ufb01c priors. Fig. 3 shows a typical partition over 100\nclasses that our model learns with many super-categories containing semantically similar classes.\nWe next illustrate the ability of the HDP-DBM to generalize from a single training example of a\n\u201cpear\u201d class. We trained the model on 99 classes containing 500 training images each, but only one\ntraining example of a \u201cpear\u201d class. Fig. 4 shows the kind of transfer our model is performing: the\nmodel discovers that pears are like apples and oranges, and not like other classes of images, such as\ndolphins, that reside in very different parts of the hierarchy. Hence the novel category can inherit\n\n5The dataset contains random images of natural scenes downloaded from the web\n6We also experimented with a 3-layer DBM model, as well as various softmax parameters: M = 500 and\n\nM = 2000. The difference in performance was not signi\ufb01cant.\n\n6\n\n\fShared HDP high-level features\n\nShape\n\nColor\n\nLearning with 3 examples\n\nCIFAR Dataset\n\nCharacters Dataset\n\nFigure 4: Left: Training examples along with eight most probable topics \u03c6t, ordered by hand. Right: Perfor-\nmance of HDP-DBM, DBM, and SVMs for all object classes when learning with 3 examples. Object categories\nare sorted by their performance.\n\nCIFAR Dataset\nNumber of examples\n\nHandwritten Characters\n\nNumber of examples\n\nMotion Capture\nNumber of examples\n\nModel\n1\nTuned HDP-DBM 0.36\n0.34\nHDP-DBM\n0.27\nFlat HDP-DBM\n0.26\nDBM\n0.25\nDBN\nSVM\n0.18\n0.17\n1-NN\nGIST\n0.27\n\n3\n0.41\n0.39\n0.37\n0.36\n0.33\n0.27\n0.18\n0.31\n\n5\n0.46\n0.45\n0.42\n0.41\n0.37\n0.31\n0.19\n0.33\n\n10\n0.53\n0.52\n0.50\n0.48\n0.45\n0.38\n0.20\n0.39\n\n50\n0.62\n0.61\n0.61\n0.61\n0.60\n0.61\n0.32\n0.58\n\n1\n0.67\n0.65\n0.58\n0.57\n0.51\n0.41\n0.43\n\n-\n\n3\n0.78\n0.76\n0.73\n0.72\n0.72\n0.66\n0.65\n\n-\n\n5\n0.87\n0.85\n0.82\n0.81\n0.81\n0.77\n0.73\n\n-\n\n10\n0.93\n0.92\n0.89\n0.89\n0.89\n0.86\n0.81\n\n-\n\n1\n0.67\n0.66\n0.63\n0.61\n0.61\n0.54\n0.58\n\n-\n\n3\n0.84\n0.82\n0.79\n0.79\n0.79\n0.78\n0.75\n\n-\n\n5\n0.90\n0.88\n0.86\n0.85\n0.84\n0.84\n0.81\n\n-\n\n10\n0.93\n0.93\n0.91\n0.91\n0.92\n0.91\n0.88\n\n-\n\n50\n0.96\n0.96\n0.96\n0.95\n0.96\n0.96\n0.93\n\nTable 1: Classi\ufb01cation performance on the test set using 2*AUROC-1. The results in bold correspond to ROCs\nthat are statistically indistinguishable from the best (the difference is not statistically signi\ufb01cant).\n\nthe prior distribution over similar high-level shape and color features, allowing the HDP-DBM to\ngeneralize considerably better to new instances of the \u201cpear\u201d class.\nTable 1 quanti\ufb01es performance using the area under the ROC curve (AUROC) for classifying 10,000\ntest images as belonging to the novel vs. all other 99 classes (we report 2*AUROC-1, so zero cor-\nresponds to the classi\ufb01er that makes random predictions). The results are averaged over 100 classes\nusing \u201cleave-one-out\u201d test format. Based on a single example, the HDP-DBM model achieves an\nAUROC of 0.36, signi\ufb01cantly outperforming DBMs, DBNs, SVMs, as well as 1-NN using standard\nimage-speci\ufb01c GIST features [24] that achieve an AUROC of 0.26, 0.25, 0.18 and 0.27 respectively.\nTable 1 also shows that \ufb01ne-tuning parameters of all layers jointly as well as learning super-category\nhierarchy signi\ufb01cantly improves model performance. As the number of training examples increases,\nthe HDP-DBM model still consistently outperforms alternative methods. Fig. 4 further displays per-\nformance of HDP-DBM, DBM, and SVM models for all object categories when learning with only\nthree examples. Observe that over 40 classes bene\ufb01t in various degrees from learning a hierarchy.\n\n5.2 Handwritten Characters\nThe handwritten characters dataset [9] can be viewed as the \u201ctranspose\u201d of MNIST. Instead of con-\ntaining 60,000 images of 10 digit classes, the dataset contains 30,000 images of 1500 characters (20\nexamples each) with 28 \u00d7 28 pixels. These characters are from 50 alphabets from around the world,\nincluding Bengali, Cyrillic, Arabic, Sanskrit, Tagalog (see Fig. 5). We split the dataset into 15,000\ntraining and 15,000 test images (10 examples of each class). Similar to the CIFAR dataset, we pre-\ntrain a two-layer DBM model, with the \ufb01rst layer containing 1000 hidden units, and the second layer\ncontaining M=100 softmax units, each de\ufb01ning a distribution over 1000 second layer features.\nFig. 2 displays a random subset of training images, along with the 1st and 2nd layer DBM features,\nas well as higher-level class-sensitive HDP features. The HDP features tend to capture higher-level\nparts, many of which resemble pen \u201cstrokes\u201d. Table 1 further shows results for classifying 15,000\ntest images as belonging to the novel vs. all other 1,499 character classes. The HDP-DBM model\nsigni\ufb01cantly outperforms other methods, particularly when learning characters with few training\nexamples. Fig. 6 further displays learned super-classes along with examples of entirely novel char-\nacters that have been generated by the model for the same super-class, as well as conditional samples\n\n7\n\n020040060080010001200140000.10.20.30.40.50.60.70.80.91Sorted Class Index2*AUROC\u22121 HDP\u2212DBMDBMSVM010203040506070809010000.10.20.30.40.50.60.70.80.9Sorted Class Index2*AUROC\u22121 HDP\u2212DBMDBMSVM\fTraining samples\n\nDBM features\n\n1st layer\n\n2nd layer\n\nHDP high-level features\n\nFigure 5: A random subset of the training images along with 1st and 2nd layer DBM features, as well as\nhigher-level class-sensitive HDP features/topics.\n\nLearning with 3 examples\n\nLearned Super-Classes (by row) Sampled Novel Characters\n\nTraining Examples\n\nConditional Samples\n\nFigure 6: Left: Learned super-classes along with examples of novel characters, generated by the model for\nthe same super-class. Right: Three training examples along with 8 conditional samples.\nwhen learning only with three training examples. (we note that using Deep Belief Networks instead\nof DBMs produced far inferior generative samples). Remarkably, many samples look realistic, con-\ntaining coherent, long-range structure, while at the same time being different from existing training\nimages (see Supplementary Materials for a much richer class of generated samples).\n\n5.3 Motion capture\nWe next applied our model to human motion capture data consisting of sequences of 3D joint angles\nplus body orientation and translation [18]. The dataset contains 10 walking styles, including normal,\ndrunk, graceful, gangly, sexy, dinosaur, chicken, old person, cat, and strong. There are 2500 frames\nof each style at 60fps, where each time step was represented by a vector of 58 real-valued numbers.\nThe dataset was split at random into 1500 training and 1000 test frames of each style. We further\npreprocessed the data by treating each window of 10 consecutive frames as a single 58 \u2217 10 = 580-\nd data vector. For the two-layer DBM model, the \ufb01rst layer contained 500 hidden units, with the\nsecond layer containing M=50 softmax units, each de\ufb01ning a distribution over 500 second layer\nfeatures. As expected, Table 1 shows that the HDP-DBM model performs much better compared\nto other models when discriminating between existing nine walking styles vs. novel walking style.\nThe difference is particularly large in the regime when we observe only a handful number of training\nexamples of a novel walking style.\n\n6 Conclusions\nWe developed a compositional architecture that learns an HDP prior over the activities of top-level\nfeatures of the DBM model. The resulting compound HDP-DBM model is able to learn low-level\nfeatures from raw sensory input, high-level features, as well as a category hierarchy for parameter\nsharing. Our experimental results show that the proposed model can acquire new concepts from\nvery few examples in a diverse set of application domains. The compositional model considered in\nthis paper was directly inspired by the architecture of the DBM and HDP, but it need not be. Indeed,\nany other deep learning module, including Deep Belief Networks, sparse auto-encoders, or any\nother hierarchical Bayesian model can be adapted. This perspective opens a space of compositional\nmodels that may be more suitable for capturing the human-like ability to learn from few examples.\nAcknowledgments: This research was supported by NSERC, ONR (MURI Grant 1015GNA126),\nONR N00014-07-1-0937, ARO W911NF-08-1-0242, and Qualcomm.\n\n8\n\n\fReferences\n[1] E. Bart, I. Porteous, P. Perona, and M. Welling. Unsupervised learning of visual taxonomies.\n\nIn CVPR, pages 1\u20138, 2008.\n\n[2] David M. Blei, Thomas L. Grif\ufb01ths, and Michael I. Jordan. The nested chinese restaurant\n\nprocess and bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010.\n\n[3] Kevin R. Canini and Thomas L. Grif\ufb01ths. Modeling human transfer learning with the hierar-\n\nchical dirichlet process. In NIPS 2009 workshop: Nonparametric Bayes, 2009.\n[4] Li Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories.\n\nPattern Analysis and Machine Intelligence, 28(4):594\u2013611, April 2006.\n\nIEEE Trans.\n\n[5] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural\n\n[6] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural net-\n\nComputation, 18(7):1527\u20131554, 2006.\n\nworks. Science, 313(5786):504 \u2013 507, 2006.\n\n[7] C. Kemp, A. Perfors, and J. Tenenbaum. Learning overhypotheses with hierarchical Bayesian\n\nmodels. Developmental Science, 10(3):307\u2013321, 2006.\n\n[8] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nDept. of Computer Science, University of Toronto, 2009.\n\n[9] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Josh Tenenbaum. One-shot learning\nIn Proceedings of the 33rd Annual Conference of the Cognitive\n\nof simple visual concepts.\nScience Society, 2011.\n\n[10] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for training deep\n\nneural networks. Journal of Machine Learning Research, 10:1\u201340, 2009.\n\n[11] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief\nnetworks for scalable unsupervised learning of hierarchical representations. In Proceedings of\nthe 26th International Conference on Machine Learning, pages 609\u2013616, 2009.\n\n[12] M. A. Ranzato, Y. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks.\n\nAdvances in Neural Information Processing Systems, 2008.\n\n[13] A. Rodriguez, D. Dunson, and A. Gelfand. The nested Dirichlet process. Journal of the\n\nAmerican Statistical Association, 103:11311144, 2008.\n\n[14] R. R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines.\n\nIn Proceedings of the\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 12, 2009.\n\n[15] R. R. Salakhutdinov and G. E. Hinton. Replicated softmax: an undirected topic model. In\n\nAdvances in Neural Information Processing Systems, volume 22, 2010.\n\n[16] L.B. Smith, S.S. Jones, B. Landau, L. Gershkoff-Stowe, and L. Samuelson. Object name\nlearning provides on-the-job training for attention. Psychological Science, pages 13\u201319, 2002.\n[17] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Describing visual scenes using\ntransformed objects and parts. International Journal of Computer Vision, 77(1-3):291\u2013330,\n2008.\n\n[18] G. Taylor, G. E. Hinton, and S. T. Roweis. Modeling human motion using binary latent vari-\n\nables. In Advances in Neural Information Processing Systems. MIT Press, 2006.\n\n[19] Y. W. Teh and G. E. Hinton. Rate-coded restricted Boltzmann machines for face recognition.\n\nIn Advances in Neural Information Processing Systems, volume 13, 2001.\n\n[20] Y. W. Teh and M. I. Jordan. Hierarchical Bayesian nonparametric models with applications. In\n\nBayesian Nonparametrics: Principles and Practice. Cambridge University Press, 2010.\n\n[21] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journal\n\nof the American Statistical Association, 101(476):1566\u20131581, 2006.\n\n[22] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood\n\ngradient. In ICML. ACM, 2008.\n\n[23] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a large dataset for non-\nparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 30(11):1958\u20131970, 2008.\n\n[24] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition.\n\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008.\n\n[25] Fei Xu and Joshua B. Tenenbaum. Word learning as bayesian inference. Psychological Review,\n\n[26] L. Younes. On the convergence of Markovian stochastic algorithms with rapidly decreasing\n\n114(2), 2007.\n\nergodicity rates, March 17 2000.\n\n9\n\n\f", "award": [], "sourceid": 1163, "authors": [{"given_name": "Antonio", "family_name": "Torralba", "institution": null}, {"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}