{"title": "Discriminative Transfer Learning with Tree-based Priors", "book": "Advances in Neural Information Processing Systems", "page_first": 2094, "page_last": 2102, "abstract": "This paper proposes a way of improving classification performance for classes which have very few training examples. The key idea is to discover classes which are similar and transfer knowledge among them. Our method organizes the classes into a tree hierarchy. The tree structure can be used to impose a generative prior over classification parameters. We show that these priors can be combined with discriminative models such as deep neural networks. Our method benefits from the power of discriminative training of deep neural networks, at the same time using tree-based generative priors over classification parameters. We also propose an algorithm for learning the underlying tree structure. This gives the model some flexibility to tune the tree so that the tree is pertinent to task being solved. We show that the model can transfer knowledge across related classes using fixed semantic trees. Moreover, it can learn new meaningful trees usually leading to improved performance. Our method achieves state-of-the-art classification results on the CIFAR-100 image data set and the MIR Flickr multimodal data set.", "full_text": "Discriminative Transfer Learning with\n\nTree-based Priors\n\nNitish Srivastava\n\nRuslan Salakhutdinov\n\nDepartment of Computer Science\n\nDepartment of Computer Science and Statistics\n\nUniversity of Toronto\n\nnitish@cs.toronto.edu\n\nUniversity of Toronto\n\nrsalakhu@cs.toronto.edu\n\nAbstract\n\nHigh capacity classi\ufb01ers, such as deep neural networks, often struggle on classes\nthat have very few training examples. We propose a method for improving clas-\nsi\ufb01cation performance for such classes by discovering similar classes and trans-\nferring knowledge among them. Our method learns to organize the classes into\na tree hierarchy. This tree structure imposes a prior over the classi\ufb01er\u2019s param-\neters. We show that the performance of deep neural networks can be improved\nby applying these priors to the weights in the last layer. Our method combines\nthe strength of discriminatively trained deep neural networks, which typically re-\nquire large amounts of training data, with tree-based priors, making deep neural\nnetworks work well on infrequent classes as well. We also propose an algorithm\nfor learning the underlying tree structure. Starting from an initial pre-speci\ufb01ed\ntree, this algorithm modi\ufb01es the tree to make it more pertinent to the task being\nsolved, for example, removing semantic relationships in favour of visual ones for\nan image classi\ufb01cation task. Our method achieves state-of-the-art classi\ufb01cation\nresults on the CIFAR-100 image data set and the MIR Flickr image-text data set.\n\nIntroduction\n\n1\nLearning classi\ufb01ers that generalize well is a hard problem when only few training examples are\navailable. For example, if we had only 5 images of a cheetah, it would be hard to train a classi\ufb01er\nto be good at distinguishing cheetahs against hundreds of other classes, working off pixels alone.\nAny powerful enough machine learning model would severely over\ufb01t the few examples, unless it is\nheld back by strong regularizers. This paper is based on the idea that performance can be improved\nusing the natural structure inherent in the set of classes. For example, we know that cheetahs are\nrelated to tigers, lions, jaguars and leopards. Having labeled examples from these related classes\nshould make the task of learning from 5 cheetah examples much easier. Knowing class structure\nshould allow us to borrow \u201cknowledge\u201d from relevant classes so that only the distinctive features\nspeci\ufb01c to cheetahs need to be learned. At the very least, the model should confuse cheetahs with\nthese animals rather than with completely unrelated classes, such as cars or lamps. Our aim is to\ndevelop methods for transferring knowledge from related tasks towards learning a new task. In the\nendeavour to scale machine learning algorithms towards AI, it is imperative that we have good ways\nof transferring knowledge across related problems.\nFinding relatedness is also a hard problem. This is because in the absence of any prior knowledge, in\norder to \ufb01nd which classes are related, we should \ufb01rst know what the classes are - i.e., have a good\nmodel for each one of them. But to learn a good model, we need to know which classes are related.\nThis creates a cyclic dependency. One way to circumvent it is to use an external knowledge source,\nsuch as a human, to specify the class structure by hand. Another way to resolve this dependency is\nto iteratively learn a model of the what the classes are and what relationships exist between them,\nusing one to improve the other. In this paper, we follow this bootstrapping approach.\n\n1\n\n\fThis paper proposes a way of learning class structure and classi\ufb01er parameters in the context of\ndeep neural networks. The aim is to improve classi\ufb01cation accuracy for classes with few examples.\nDeep neural networks trained discriminatively with back propagation achieved state-of-the-art per-\nformance on dif\ufb01cult classi\ufb01cation problems with large amounts of labeled data [2, 14, 15]. The\ncase of smaller amounts of data or datasets which contain rare classes has been relatively less stud-\nied. To address this shortcoming, our model augments neural networks with a tree-based prior over\nthe last layer of weights. We structure the prior so that related classes share the same prior. This\nshared prior captures the features that are common across all members of any particular superclass.\nTherefore, a class with few examples, for which the model would otherwise be unable to learn good\nfeatures for, can now have access to good features just by virtue of belonging to the superclass.\nLearning a hierarchical structure over classes has been extensively studied in the machine learning,\nstatistics, and vision communities. A large class of models based on hierarchical Bayesian models\nhave been used for transfer learning [20, 4, 1, 3, 5]. The hierarchical topic model for image features\nof Bart et.al. [1] can discover visual taxonomies in an unsupervised fashion from large datasets but\nwas not designed for rapid learning of new categories. Fei-Fei et.al. [5] also developed a hierarchi-\ncal Bayesian model for visual categories, with a prior on the parameters of new categories that was\ninduced from other categories. However, their approach is not well-suited as a generic approach\nto transfer learning because they learned a single prior shared across all categories. A number of\nmodels based on hierarchical Dirichlet processes have also been used for transfer learning [23, 17].\nHowever, almost all of the the above-mentioned models are generative by nature. These models\ntypically resort to MCMC approaches for inference, that are hard to scale to large datasets. Fur-\nthermore, they tend to perform worse than discriminative approaches, particularly as the number of\nlabeled examples increases.\nA large class of discriminative models [12, 25, 11] have also been used for transfer learning that\nenable discovering and sharing information among related classes. Most similar to our work is [18]\nwhich de\ufb01ned a generative prior over the classi\ufb01er parameters and a prior over the tree structures to\nidentify relevant categories. However, this work focused on a very speci\ufb01c object detection task and\nused an SVM model with pre-de\ufb01ned HOG features as its input. In this paper, we demonstrate our\nmethod on two different deep architectures (1) convolutional nets with pixels as input and single-\nlabel softmax outputs and (2) fully connected nets pretrained using deep Boltzmann machines with\nimage features and text tokens as input and multi-label logistic outputs. Our model improves perfor-\nmance over strong baselines in both cases, lending some measure of universality to the approach. In\nessence, our model learns low-level features, high-level features, as well as a hierarchy over classes\nin an end-to-end way.\n\n2 Model Description\nLet X = {x1, x2, . . . , xN} be a set of N data points and Y = {y1, y2, . . . , yN} be the set of\ncorresponding labels, where each label yi is a K dimensional vector of targets. These targets could\nbe binary, one-of-K, or real-valued. In our setting, it is useful to think of each xi as an image and\nyi as a one-of-K encoding of the label. The model is a multi-layer neural network (see Fig. 1a). Let\nw denote the set of all parameters of this network (weights and biases for all the layers), excluding\nthe top-level weights, which we denote separately as \u03b2 \u2208 RD\u00d7K. Here D represents the number of\nhidden units in the last hidden layer. The conditional distribution over Y can be expressed as\n\n(cid:90)\n\nP (Y|X ) =\n\nw,\u03b2\n\nP (Y|X , w, \u03b2)P (w)P (\u03b2)dwd\u03b2.\n\n(1)\n\nIn general, this integral is intractable, and we typically resort to MAP estimation to determine the\nvalues of the model parameters w and \u03b2 that maximize\n\nHere, log P (Y|X , w, \u03b2) is the log-likelihood function and the other terms are priors over the model\u2019s\nparameters. A typical choice of prior is a Gaussian distribution with diagonal covariance:\n\nlog P (Y|X , w, \u03b2) + log P (w) + log P (\u03b2).\n\n(cid:18)\n\n(cid:19)\n\n\u03b2k \u223c N\n\n0,\n\n1\n\u03bb\n\nID\n\n,\u2200k \u2208 {1, . . . , K}.\n\nHere \u03b2k \u2208 RD denotes the classi\ufb01er parameters for class k. Note that this prior assumes that each \u03b2k\nis independent of all other \u03b2i\u2019s. In other words, a-priori, the weights for label k are not related to any\n\n2\n\n\fPredictions\n\nHigh level\nfeatures\n\nLow level\nfeatures\n\nInput\n\nK\n\u2022 \u2022 \u2022\n\n\u03b2car\n\n\u03b2tiger\n\nD\n\n\u2022 \u2022 \u2022\n\n(a)\n\n\u02c6y\n\n\u03b2\n\nfw(x)\n\nw\n\n\u03b8vehicle\n\n\u03b8animal\n\n\u2022 \u2022 \u2022\n\nx\n\n\u03b2car\n\n\u03b2truck\n\n\u03b2cheetah\n\n\u03b2tiger\n(b)\n\nFigure 1: Our model: A deep neural network with priors over the classi\ufb01cation parameters. The priors are\nderived from a hierarchy over classes.\nother label\u2019s weights. This is a reasonable assumption when nothing is known about the labels. It\nworks quite well for most applications with large number of labeled examples per class. However, if\nwe know that the classes are related to one another, priors which respect these relationships may be\nmore suitable. Such priors would be crucial for classes that only have a handful of training examples,\nsince the effect of the prior would be more pronounced. In this work, we focus on developing such\na prior.\n\n2.1 Learning With a Fixed Tree Hierarchy\nLet us \ufb01rst assume that the classes have been organized into a \ufb01xed tree hierarchy which is available\nto us. We will relax this assumption later by placing a hierarchical non-parametric prior over the tree\nstructures. For ease of exposition, consider a two-level hierarchy1, as shown in Fig. 1b. There are\nK leaf nodes corresponding to the K classes. They are connected to S super-classes which group\ntogether similar basic-level classes. Each leaf node k is associated with a weight vector \u03b2k \u2208 RD.\nEach super-class node s is associated with a vector \u03b8s \u2208 RD, s = 1, ..., S. We de\ufb01ne the following\ngenerative model for \u03b2\n\n(cid:19)\n\n(cid:18)\n\n\u03b8s \u223c N\n\n0,\n\n1\n\u03bb1\n\nID\n\n,\n\n\u03b2k \u223c N\n\n\u03b8parent(k),\n\n(cid:19)\n\n1\n\u03bb2\n\nID\n\n.\n\n(2)\n\n(cid:18)\n\n(cid:90)\n\nThis prior expresses relationships between classes. For example, it asserts that \u03b2car and \u03b2truck are\nboth deviations from \u03b8vehicle. Similarly, \u03b2cat and \u03b2dog are deviations from \u03b8animal. Eq. 1 can now be\nre-written to include \u03b8 as follows\n\nP (Y|X ) =\n\nw,\u03b2,\u03b8\n\nP (Y|X , w, \u03b2)P (w)P (\u03b2|\u03b8)P (\u03b8)dwd\u03b2d\u03b8.\n\n(3)\n\nWe can perform MAP inference to determine the values of {w, \u03b2, \u03b8} that maximize\n\nlog P (Y|X , w, \u03b2) + log P (w) + log P (\u03b2|\u03b8) + log P (\u03b8).\n\nIn terms of a loss function, we wish to minimize\n\nL(w, \u03b2, \u03b8) = \u2212 log P (Y|X , w, \u03b2) \u2212 log P (w) \u2212 log P (\u03b2|\u03b8) \u2212 log P (\u03b8)\n\n\u03bb2\n2 ||w||2 +\n\n\u03bb2\n2\n\n= \u2212 log P (Y|X , w, \u03b2) +\n\n\u03bb1\n2 ||\u03b8||2. (4)\nNote that by \ufb01xing the value of \u03b8 = 0, this loss function recovers our standard loss function. The\nchoice of normal distributions in Eq. 2 leads to a nice property that maximization over \u03b8, given \u03b2 can\nbe done in closed form. It just amounts to taking a (scaled) average of all \u03b2k\u2019s which are children of\n\u03b8s. Let Cs = {k|parent(k) = s}, then\n\u03b8\u2217\ns =\n\n||\u03b2k \u2212 \u03b8parent(k)||2 +\n\n(cid:88)\n\n(5)\n\n.\u03b2k\n\n1\n\nK(cid:88)\n\nk=1\n\n|Cs| + \u03bb1/\u03bb2\n\nk\u2208Cs\n\n1The model can be easily generalized to deeper hierarchies.\n\n3\n\n\f// Optimize w, \u03b2 with \ufb01xed z.\n\n1: Given: X ,Y, classes K, superclasses S, initial z, L, M.\n2: Initialize w, \u03b2.\n3: repeat\n4:\n5: w, \u03b2 \u2190 SGD (X ,Y, w, \u03b2, z) for L steps.\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: until convergence\n\n// Optimize z, \u03b2 with \ufb01xed w.\nRandomPermute(K)\nfor k in K do\nzk \u2190 s\n\u03b2s \u2190 SGD (fw(X ),Y, \u03b2, z) for M steps.\n\nfor s in S \u222a {snew} do\n\nend for\ns(cid:48) \u2190 ChooseBestSuperclass(\u03b21, \u03b22, . . .)\n\u03b2 \u2190 \u03b2s(cid:48)\n\n, zk \u2190 s(cid:48), S \u2190 S \u222a {s(cid:48)}\n\nend for\n\nk = van\n\ns=vehicle\n\ns=animal\n\ns=snew\n\ncar\n\ntruck\n\nvan\n\ncat\n\ndog\n\nvan\n\nvan\n\nAlgorithm 1: Procedure for learning the tree.\n\nTherefore, the loss function in Eq. 4 can be optimized by iteratively performing the following two\nsteps. In the \ufb01rst step, we maximize over w and \u03b2 keeping \u03b8 \ufb01xed. This can be done using standard\nstochastic gradient descent (SGD). Then, we maximize over \u03b8 keeping \u03b2 \ufb01xed. This can be done in\nclosed form using Eq. 5. In practical terms, the second step is almost instantaneous and only needs\nto be performed after every T gradient descent steps, where T is around 10-100. Therefore, learning\nis almost identical to standard gradient descent. It allows us to exploit the structure over labels at a\nvery nominal cost in terms of computational time.\n\n2.2 Learning the Tree Hierarchy\nSo far we have assumed that our model is given a \ufb01xed tree hierarchy. Now, we show how the tree\nstructure can be learned during training. Let z be a K-length vector that speci\ufb01es the tree structure,\nthat is, zk = s indicates that class k is a child of super-class s. We place a non-parametric Chinese\nRestaurant Process (CRP) prior over z. This prior P (z) gives the model the \ufb02exibility to have any\nnumber of superclasses. The CRP prior extends a partition of k classes to a new class by adding\nthe new class either to one of the existing superclasses or to a new superclass. The probability of\nadding it to superclass s is cs\nk+\u03b3 where cs is the number of children of superclass s. The probability\nk+\u03b3 . In essence, it prefers to add a new node to an existing large\nof creating a new superclass is\nsuperclass instead of spawning a new one. The strength of this preference is controlled by \u03b3.\nEquipped with the CRP prior over z, the conditional over Y takes the following form\nP (z).\n\n(cid:88)\n\n(cid:18)(cid:90)\n\n(cid:19)\n\nP (Y|X , w, \u03b2)P (w)P (\u03b2|\u03b8, z)P (\u03b8)dwd\u03b2d\u03b8\n\n(6)\n\n\u03b3\n\nP (Y|X ) =\n\nz\n\nw,\u03b2,\u03b8\n\nMAP inference in this model leads to the following optimization problem\n\nmax\nw,\u03b2,\u03b8,z\n\nlog P (Y|X , w, \u03b2) + log P (w) + log P (\u03b2|\u03b8, z) + log P (\u03b8) + log P (z).\n\nMaximization over z is problematic because the domain of z is a huge discrete set. Fortunately, this\ncan be approximated using a simple and parallelizable search procedure.\nWe \ufb01rst initialize the tree sensibly. This can be done by hand or by extracting a semantic tree from\nWordNet [16]. Let the number of superclasses in the tree be S. We optimize over {w, \u03b2, \u03b8} for a\nL steps using this tree. Then, a leaf node is picked uniformly at random from the tree and S + 1\ntree proposals are generated as follows. S proposals are generated by attaching this leaf node to\neach of the S superclasses. One additional proposal is generated by creating a new super-class and\nattaching the label to it. This process is shown in Algorithm 1. We then re-estimate {\u03b2, \u03b8} for\neach of these S + 1 trees for a few steps. Note that each of the S + 1 optimization problems can\nbe performed independently, in parallel. The best tree is then picked using a validation set. This\nprocess is repeated by picking another node and again trying all possible locations for it. After each\nnode has been picked once and potentially repositioned, we take the resulting tree and go back to\n\n4\n\n\fwhale\n\nwillow tree\n\nlamp\n\nleopard\n\nray\n\ndolphin\n\noak tree\n\nclock\n\ntiger\n\n\ufb02at\ufb01sh\n\nFigure 2: Examples from CIFAR-100. Five randomly chosen examples from 8 of the 100 classes are shown.\nClasses in each row belong to the same superclass.\n\noptimizing w, \u03b2 using this newly learned tree in place of the given tree. If the position of any class\nin the tree did not change during a full pass through all the classes, the hierarchy discovery was\nsaid to have converged. When training this model on CIFAR-100, this amounts to interrupting the\nstochastic gradient descent after every 10,000 steps to \ufb01nd a better tree. The amount of time spent\nin learning this tree is a small fraction of the total time (about 5%).\n\n3 Experiments on CIFAR-100\nThe CIFAR-100 dataset [13] consists of 32 \u00d7 32 color images belonging to 100 classes. These\nclasses are divided into 20 groups of 5 each. For example, the superclass \ufb01sh contains aquarium\n\ufb01sh, \ufb02at\ufb01sh, ray, shark and trout; and superclass \ufb02owers contains orchids, poppies, roses, sun\ufb02owers\nand tulips. Some examples from this dataset are shown in Fig. 2. We chose this dataset because it has\na large number of classes with a few examples in each, making it ideal for demonstrating the utility\nof transfer learning. There are only 600 examples of each class of which 500 are in the training set\nand 100 in the test set. We preprocessed the images by doing global contrast normalization followed\nby ZCA whitening.\n\n3.1 Model Architecture and Training Details\nWe used a convolutional neural network with 3 convolutional hidden layers followed by 2 fully con-\nnected hidden layers. All hidden units used a recti\ufb01ed linear activation function. Each convolutional\nlayer was followed by a max-pooling layer. Dropout [8] was applied to all the layers of the net-\nwork with the probability of retaining a hidden unit being p = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5) for the\ndifferent layers of the network (going from input to convolutional layers to fully connected layers).\nMax-norm regularization [8] was used for weights in both convolutional and fully connected layers.\nThe initial tree was chosen based on the superclass structure given in the data set. We learned a\ntree using Algorithm 1 with L = 10, 000 and M = 100. The \ufb01nal learned tree is provided in the\nsupplementary material.\n\n3.2 Experiments with Few Examples per Class\nIn our \ufb01rst set of experiments, we worked in a scenario where each class has very few examples.\nThe aim was to assess whether the proposed model allows related classes to borrow information\nfrom each other. For a baseline, we used a standard convolutional neural network with the same\narchitecture as our model. This is an extremely strong baseline and already achieved excellent\nresults, outperforming all previously reported results on this dataset as shown in Table 1. We created\n5 subsets of the data by randomly choosing 10, 25, 50, 100 and 250 examples per class, and trained\nfour models on each subset. The \ufb01rst was the baseline. The second was our model using the given\ntree structure (100 classes grouped into 20 superclasses) which was kept \ufb01xed during training. The\nthird and fourth were our models with a learned tree structure. The third one was initialized with\na random tree and the fourth with the given tree. The random tree was constructed by drawing a\nsample from the CRP prior and randomly assigning classes to leaf nodes.\nThe test performance of these models is compared in Fig. 3a. We observe that if the number of\nexamples per class is small, the \ufb01xed tree model already provides signi\ufb01cant improvement over\nthe baseline. The improvement diminishes as the number of examples increases and eventually\nthe performance falls below the baseline (61.7% vs 62.8%). However, the learned tree model does\n\n5\n\n\f(a)\n\n(b)\n\nFigure 3: Classi\ufb01cation results on CIFAR-100. Left: Test set classi\ufb01cation accuracy for different number of\ntraining examples per class. Right: Improvement over the baseline when trained on 10 examples per class. The\nlearned tree models were initialized at the given tree.\n\nMethod\nConv Net + max pooling\nConv Net + stochastic pooling [24]\nConv Net + maxout [6]\nConv Net + max pooling + dropout (Baseline)\nBaseline + \ufb01xed tree\nBaseline + learned tree (Initialized randomly)\nBaseline + learned tree (Initialized from given tree)\n\nTest Accuracy %\n\n56.62 \u00b1 0.03\n\n57.49\n61.43\n\n62.80 \u00b1 0.08\n61.70 \u00b1 0.06\n61.20 \u00b1 0.35\n63.15 \u00b1 0.15\n\nTable 1: Classi\ufb01cation results on CIFAR-100. All models were trained on the full training set.\n\nbetter. Even with 10 examples per class, it gets an accuracy of 18.52% compared to the baseline\nmodel\u2019s 12.81% or the \ufb01xed tree model\u2019s 16.29%. Thus the model can get almost a 50% relative\nimprovement when few examples are available. As the number of examples increases, the relative\nimprovement decreases. However, even for 500 examples per class, the learned tree model improves\nupon the baseline, achieving a classi\ufb01cation accuracy of 63.15%. Note that initializing the model\nwith a random tree decreases model performance, as shown in Table 1.\nNext, we analyzed the learned tree model to \ufb01nd the source of the improvements. We took the model\ntrained on 10 examples per class and looked at the classi\ufb01cation accuracy separately for each class.\nThe aim was to \ufb01nd which classes gain or suffer the most. Fig. 3b shows the improvement obtained\nby different classes over the baseline, where the classes are sorted by the value of the improvement\nover the baseline. Observe that about 70 classes bene\ufb01t in different degrees from learning a hierarchy\nfor parameter sharing, whereas about 30 classes perform worse as a result of transfer. For the learned\ntree model, the classes which improve most are willow tree (+26%) and orchid (+25%). The classes\nwhich lose most from the transfer are ray (-10%) and lamp (-10%).\nWe hypothesize that the reason why certain classes gain a lot is that they are very similar to other\nclasses within their superclass and thus stand to gain a lot by transferring knowledge. For example,\nthe superclass for willow tree contains other trees, such as maple tree and oak tree. However, ray\nbelongs to superclass \ufb01sh which contains more typical examples of \ufb01sh that are very dissimilar in\nappearance. With the \ufb01xed tree, such transfer hurts performance (ray did worse by -29%). However,\nwhen the tree was learned, this class split away from the \ufb01sh superclass to join a new superclass and\ndid not suffer as much. Similarly, lamp was under household electrical devices along with keyboard\nand clock. Putting different kinds of electrical devices under one superclass makes semantic sense\nbut does not help for visual recognition tasks. This highlights a key limitation of hierarchies based\non semantic knowledge and advocates the need to learn the hierarchy so that it becomes relevant to\nthe task at hand. The full learned tree is provided in the supplementary material.\n\n3.3 Experiments with Few Examples for One Class\nIn this set of experiments, we worked in a scenario where there are lots of examples for different\nclasses, but only few examples of one particular class. The aim was to see whether the model\ntransfers information from other classes that it has learned to this \u201crare\u201d class. We constructed\ntraining sets by randomly drawing either 5, 10, 25, 50, 100, 250 or 500 examples from the dolphin\n\n6\n\n102550100250500Numberoftrainingexamplesperlabel10203040506070Testclassi\ufb01cationaccuracyBaselineFixedTreeLearnedTree020406080100Sortedclasses\u221230\u221220\u2212100102030ImprovementBaselineFixedTreeLearnedTree\f(a)\n\n(b)\n\nFigure 4: Results on CIFAR-100 with few examples for the dolphin class. Left: Test set classi\ufb01cation accuracy\nfor different number of examples. Right: Accuracy when classifying a dolphin as whale or shark is also\nconsidered correct.\n\nClasses\n\nbaby, female, people,\n\nportrait\n\nplant life, river,\n\nwater\n\nclouds, sea, sky,\ntransport, water\n\nanimals, dog, food\n\nclouds, sky, structures\n\nImages\n\nTags\n\nclaudia\n\n(cid:104) no text (cid:105)\n\nbarco, pesca,\n\nboattosail, navegac\u00b8\u00afao\n\nwatermelon, dog,\nhilarious, chihuahua\n\ncolors, cores, centro,\ncommercial, building\n\nFigure 5: Some examples from the MIR-Flickr dataset. Each instance in the dataset is an image along with\ntextual tags. Each image has multiple classes.\n\nclass and all 500 training examples for the other 99 classes. We trained the baseline, \ufb01xed tree and\nlearned tree models with each of these datasets. The objective was kept the same as before and\nno special attention was paid to the dolphin class. Fig. 4a shows the test accuracy for correctly\npredicting the dolphin class. We see that transfer learning helped tremendously. For example, with\n10 cases, the baseline gets 0% accuracy whereas the transfer learning model can get around 3%.\nEven for 250 cases, the learned tree model gives signi\ufb01cant improvements (31% to 34%). We\nrepeated this experiment for classes other than dolphin as well and found similar improvements. See\nthe supplementary material for a more detailed description.\nIn addition to performing well on the class with few examples, we would also want any errors\nto be sensible. To check if this was indeed the case, we evaluated the performance of the above\nmodels treating the classi\ufb01cation of dolphin as shark or whale to also be correct, since we believe\nthese to be reasonable mistakes. Fig. 4b shows the classi\ufb01cation accuracy under this assumption for\ndifferent models. Observe that the transfer learning methods provide signi\ufb01cant improvements over\nthe baseline. Even when we have just 1 example for dolphin, the accuracy jumps from 45% to 52%.\n\n4 Experiments on MIR Flickr\nThe Multimedia Information Retrieval Flickr Data set [9] consists of 1 million images collected\nfrom the social photography website Flickr along with their user assigned tags. Among the 1 million\nimages, 25,000 have been annotated using 38 labels. These labels include object categories such as,\nbird, tree, people, as well as scene categories, such as indoor, sky and night. Each image has multiple\nlabels. Some examples are shown in Fig. 5.\nThis dataset is different from CIFAR-100 in many ways. In the CIFAR-100 dataset, our model was\ntrained using image pixels as input and each image belonged to only one class. MIR-FLickr is a\nmultimodal dataset for which we used standard computer vision image features and word counts\nas inputs. The CIFAR-100 models used a multi-layer convolutional network, whereas for this\ndataset we use a fully connected neural network initialized by unrolling a Deep Boltzmann Machine\n(DBM) [19]. Moreover, this dataset offers a more natural class distribution where some classes oc-\ncur more often than others. For example, sky occurs in over 30% of the instances, whereas baby\noccurs in fewer than 0.4%. We also used 975,000 unlabeled images for unsupervised training of the\nDBM. We use the publicly available features and train-test splits from [21].\n\n7\n\n5102550100250500Numberoftrainingcasesfordolphin010203040506070Testclassi\ufb01cationaccuracyBaselineFixedTreeLearnedTree15102550100250500Numberoftrainingcasesfordolphin455055606570758085Testclassi\ufb01cationaccuracyBaselineFixedTreeLearnedTree\f(a) Class-wise improvement\n\n(b) Improvement vs. number of examples\n\nFigure 6: Results on MIR Flickr. Left: Improvement in Average Precision over the baseline for different\nmethods. Right: Improvement of the learned tree model over the baseline for different classes along with the\nfraction of test cases which contain that class. Each dot corresponds to a class. Classes with few examples\n(towards the left of plot) usually get signi\ufb01cant improvements.\n\nMethod\nLogistic regression on Multimodal DBM [21]\nMultiple Kernel Learning SVMs [7]\nTagProp [22]\nMultimodal DBM + \ufb01netuning + dropout (Baseline)\nBaseline + \ufb01xed tree\nBaseline + learned tree (initialized from given tree)\n\nMAP\n0.609\n0.623\n0.640\n\n0.641 \u00b1 0.004\n0.648 \u00b1 0.004\n0.651 \u00b1 0.005\n\nTable 2: Mean Average Precision obtained by different models on the MIR-Flickr data set.\n\n4.1 Model Architecture and Training Details\nIn order to make our results directly comparable to [21], we used the same network architecture as\ndescribed therein. The authors of the dataset [10] provided a high-level categorization of the classes\nwhich we use to create an initial tree. This tree structure and the one learned by our model are shown\nin the supplementary material. We used Algorithm 1 with L = 500 and M = 100.\n\n4.2 Classi\ufb01cation Results\nFor a baseline we used a Multimodal DBM model after \ufb01netuning it discriminatively with dropout.\nThis model already achieves state-of-the-art results, making it a very strong baseline. The results\nof the experiment are summarized in Table 2. The baseline achieved a MAP of 0.641, whereas our\nmodel with a \ufb01xed tree improved this to 0.647. Learning the tree structure further pushed this up to\n0.651. For this dataset, the learned tree was not signi\ufb01cantly different from the given tree. Therefore,\nwe expected the improvement from learning the tree to be marginal. However, the improvement over\nthe baseline was signi\ufb01cant, showing that transferring information between related classes helped.\nLooking closely at the source of gains, we found that similar to CIFAR-100, some classes gain\nand others lose as shown in Fig. 6a. It is encouraging to note that classes which occur rarely in\nthe dataset improve the most. This can be seen in Fig. 6b which plots the improvements of the\nlearned tree model over the baseline against the fraction of test instances that contain that class. For\nexample, the average precision for baby which occurs in only 0.4% of the test cases improves from\n0.173 (baseline) to 0.205 (learned tree). This class borrows from people and portrait both of which\noccur very frequently. The performance on sky which occurs in 31% of the test cases stays the same.\n\n5 Conclusion\nWe proposed a model that augments standard neural networks with tree-based priors over the classi-\n\ufb01cation parameters. These priors follow the hierarchical structure over classes and enable the model\nto transfer knowledge from related classes. We also proposed a way of learning the hierarchical\nstructure. Experiments show that the model achieves excellent results on two challenging datasets.\n\n8\n\n0510152025303540Sortedclasses\u22120.04\u22120.020.000.020.040.060.080.10ImprovementinAveragePrecisionBaselineFixedTreeLearnedTree0.00.10.20.30.4Fractionofinstancescontainingtheclass\u22120.04\u22120.020.000.020.040.060.08ImprovementinAveragePrecision\fReferences\n[1] E. Bart, I. Porteous, P. Perona, and M. Welling. Unsupervised learning of visual taxonomies. In CVPR,\n\npages 1\u20138, 2008.\n\n[2] Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. Large-Scale Kernel Machines, 2007.\n[3] Hal Daum\u00b4e, III. Bayesian multitask learning with latent hierarchies. In Proceedings of the Twenty-Fifth\nConference on Uncertainty in Arti\ufb01cial Intelligence, UAI \u201909, pages 135\u2013142, Arlington, Virginia, United\nStates, 2009. AUAI Press.\n\n[4] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning. In ACM SIGKDD, 2004.\n[5] Li Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Trans. Pattern Analysis\n\nand Machine Intelligence, 28(4):594\u2013611, April 2006.\n\n[6] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout\nnetworks. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages\n1319\u20131327, 2013.\n\n[7] M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning for image classi\ufb01cation.\nIn Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 902 \u2013909, june\n2010.\n\n[8] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Im-\n\nproving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.\n\n[9] Mark J. Huiskes and Michael S. Lew. The MIR Flickr retrieval evaluation. In MIR \u201908: Proceedings\nof the 2008 ACM International Conference on Multimedia Information Retrieval, New York, NY, USA,\n2008. ACM.\n\n[10] Mark J. Huiskes, Bart Thomee, and Michael S. Lew. New trends and ideas in visual concept detection:\nthe MIR \ufb02ickr retrieval evaluation initiative. In Multimedia Information Retrieval, pages 527\u2013536, 2010.\n[11] Zhuoliang Kang, Kristen Grauman, and Fei Sha. Learning with whom to share in multi-task feature\nlearning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML\n\u201911, pages 521\u2013528, New York, NY, USA, June 2011. ACM.\n\n[12] Seyoung Kim and Eric P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity.\n\nIn ICML, pages 543\u2013550, 2010.\n\n[13] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of\n\nToronto, 2009.\n\n[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems 25. MIT Press, 2012.\n\n[15] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief networks\nfor scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Interna-\ntional Conference on Machine Learning, pages 609\u2013616, 2009.\n\n[16] George A. Miller. Wordnet: a lexical database for english. Commun. ACM, 38(11):39\u201341, November\n\n1995.\n\n[17] R. Salakhutdinov, J. Tenenbaum, and A. Torralba. Learning to learn with compound hierarchical-deep\n\nmodels. In NIPS. MIT Press, 2011.\n\n[18] R. Salakhutdinov, A. Torralba, and J. Tenenbaum. Learning to share visual appearance for multiclass\n\nobject detection. In CVPR, 2011.\n\n[19] R. R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In Proceedings of the International\n\nConference on Arti\ufb01cial Intelligence and Statistics, volume 12, 2009.\n\n[20] Babak Shahbaba and Radford M. Neal. Improving classi\ufb01cation when a class hierarchy is available using\n\na hierarchy-based prior. Bayesian Analysis, 2(1):221\u2013238, 2007.\n\n[21] Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep boltzmann machines.\n\nAdvances in Neural Information Processing Systems 25, pages 2231\u20132239. MIT Press, 2012.\n\nIn\n\n[22] Jakob Verbeek, Matthieu Guillaumin, Thomas Mensink, and Cordelia Schmid. Image Annotation with\nIn 11th ACM International Conference on Multimedia Information\n\nTagProp on the MIRFLICKR set.\nRetrieval (MIR \u201910), pages 537\u2013546. ACM Press, 2010.\n\n[23] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for classi\ufb01cation\n\nwith dirichlet process priors. J. Mach. Learn. Res., 8:35\u201363, May 2007.\n\n[24] Matthew D. Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolutional neural\n\nnetworks. CoRR, abs/1301.3557, 2013.\n\n[25] Alon Zweig and Daphna Weinshall. Hierarchical regularization cascade for joint learning. In Proceedings\nof the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 37\u201345, May\n2013.\n\n9\n\n\f", "award": [], "sourceid": 1042, "authors": [{"given_name": "Nitish", "family_name": "Srivastava", "institution": "University of Toronto"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}]}