{"title": "Zero-Shot Learning Through Cross-Modal Transfer", "book": "Advances in Neural Information Processing Systems", "page_first": 935, "page_last": 943, "abstract": "This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen categories comes from unsupervised text corpora.  Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of objects, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes.  This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images.  Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. Then, a separate recognition model can be employed for each type. We demonstrate two strategies, the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes' accuracy high.", "full_text": "Zero-Shot Learning Through Cross-Modal Transfer\n\nRichard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Y. Ng\nComputer Science Department, Stanford University, Stanford, CA 94305, USA\n\nrichard@socher.org, {mganjoo, manning}@stanford.edu, ang@cs.stanford.edu\n\nAbstract\n\nThis work introduces a model that can recognize objects in images even if no\ntraining data is available for the object class. The only necessary knowledge about\nunseen visual categories comes from unsupervised text corpora. Unlike previous\nzero-shot learning models, which can only differentiate between unseen classes,\nour model can operate on a mixture of seen and unseen classes, simultaneously\nobtaining state of the art performance on classes with thousands of training im-\nages and reasonable performance on unseen classes. This is achieved by seeing\nthe distributions of words in texts as a semantic space for understanding what ob-\njects look like. Our deep learning model does not require any manually de\ufb01ned\nsemantic or visual features for either words or images. Images are mapped to be\nclose to semantic word vectors corresponding to their classes, and the resulting\nimage embeddings can be used to distinguish whether an image is of a seen or un-\nseen class. We then use novelty detection methods to differentiate unseen classes\nfrom seen classes. We demonstrate two novelty detection strategies; the \ufb01rst gives\nhigh accuracy on unseen classes, while the second is conservative in its prediction\nof novelty and keeps the seen classes\u2019 accuracy high.\n\nIntroduction\n\n1\nThe ability to classify instances of an unseen visual class, called zero-shot learning, is useful in sev-\neral situations. There are many species and products without labeled data and new visual categories,\nsuch as the latest gadgets or car models, that are introduced frequently. In this work, we show how\nto make use of the vast amount of knowledge about the visual world available in natural language\nto classify unseen objects. We attempt to model people\u2019s ability to identify unseen objects even if\nthe only knowledge about that object came from reading about it. For instance, after reading the\ndescription of a two-wheeled self-balancing electric vehicle, controlled by a stick, with which you\ncan move around while standing on top of it, many would be able to identify a Segway, possibly after\nbeing brie\ufb02y perplexed because the new object looks different from previously observed classes.\nWe introduce a zero-shot model that can predict both seen and unseen classes. For instance, without\never seeing a cat image, it can determine whether an image shows a cat or a known category from\nthe training set such as a dog or a horse. The model is based on two main ideas.\nFig. 1 illustrates the model. First, images are mapped into a semantic space of words that is learned\nby a neural network model [15]. Word vectors capture distributional similarities from a large, unsu-\npervised text corpus. By learning an image mapping into this space, the word vectors get implicitly\ngrounded by the visual modality, allowing us to give prototypical instances for various words. Sec-\nond, because classi\ufb01ers prefer to assign test images into classes for which they have seen training\nexamples, the model incorporates novelty detection which determines whether a new image is on the\nmanifold of known categories. If the image is of a known category, a standard classi\ufb01er can be used.\nOtherwise, images are assigned to a class based on the likelihood of being an unseen category. We\nexplore two strategies for novelty detection, both of which are based on ideas from outlier detection\nmethods. The \ufb01rst strategy prefers high accuracy for unseen classes, the second for seen classes.\nUnlike previous work on zero-shot learning which can only predict intermediate features or differ-\nentiate between various zero-shot classes [21, 27], our joint model can achieve both state of the art\naccuracy on known classes as well as reasonable performance on unseen classes. Furthermore, com-\npared to related work on knowledge transfer [21, 28] we do not require manually de\ufb01ned semantic\n\n1\n\n\fFigure 1: Overview of our cross-modal zero-shot model. We \ufb01rst map each new testing image into\na lower dimensional semantic word vector space. Then, we determine whether it is on the manifold\nof seen images. If the image is \u2018novel\u2019, meaning not on the manifold, we classify it with the help of\nunsupervised semantic word vectors. In this example, the unseen classes are truck and cat.\nor visual attributes for the zero-shot classes, allowing us to use state-of-the-art unsupervised and\nunaligned image features instead along with unsupervised and unaligned language corpora.\n2 Related Work\nWe brie\ufb02y outline connections and differences to \ufb01ve related lines of research. Due to space con-\nstraints, we cannot do justice to the complete literature.\nZero-Shot Learning. The work most similar to ours is that by Palatucci et al. [27]. They map fMRI\nscans of people thinking about certain words into a space of manually designed features and then\nclassify using these features. They are able to predict semantic features even for words for which\nthey have not seen scans and experiment with differentiating between several zero-shot classes.\nHowever, they do not classify new test instances into both seen and unseen classes. We extend their\napproach to allow for this setup using novelty detection. Lampert et al. [21] construct a set of binary\nattributes for the image classes that convey various visual characteristics, such as \u201cfurry\u201d and \u201cpaws\u201d\nfor bears and \u201cwings\u201d and \u201c\ufb02ies\u201d for birds. Later, in section 6.4, we compare our method to their\nmethod of performing Direct Attribute Prediction (DAP).\nOne-Shot Learning One-shot learning [19, 20] seeks to learn a visual object class by using very few\ntraining examples. This is usually achieved by either sharing of feature representations [2], model\nparameters [12] or via similar context [14]. A recent related work on one-shot learning is that of\nSalakhutdinov et al. [29]. Similar to their work, our model is based on using deep learning tech-\nniques to learn low-level image features followed by a probabilistic model to transfer knowledge,\nwith the added advantage of needing no training data due to the cross-modal knowledge transfer\nfrom natural language.\nKnowledge and Visual Attribute Transfer. Lampert et al. and Farhadi et al. [21, 10] were two\nof the \ufb01rst to use well-designed visual attributes of unseen classes to classify them. This is different\nto our setting since we only have distributional features of words learned from unsupervised, non-\nparallel corpora and can classify between categories that have thousands or zero training images. Qi\net al. [28] learn when to transfer knowledge from one category to another for each instance.\nDomain Adaptation. Domain adaptation is useful in situations in which there is a lot of training\ndata in one domain but little to none in another. For instance, in sentiment analysis one could train a\nclassi\ufb01er for movie reviews and then adapt from that domain to book reviews [4, 13]. While related,\nthis line of work is different since there is data for each class but the features may differ between\ndomains.\nMultimodal Embeddings. Multimodal embeddings relate information from multiple sources such\nas sound and video [25] or images and text. Socher et al. [31] project words and image regions into a\ncommon space using kernelized canonical correlation analysis to obtain state of the art performance\nin annotation and segmentation. Similar to our work, they use unsupervised large text corpora to\n\n2\n\nManifold of known classesautohorsedogtruckNew test image from unknown classcatTraining images\flearn semantic word representations. Their model does require a small amount of training data\nhowever for each class. Some work has been done on multimodal distributional methods [11, 23].\nMost recently, Bruni et al. [5] worked on perceptually grounding word meaning and showed that\njoint models are better able to predict the color of concrete objects.\n3 Word and Image Representations\nWe begin the description of the full framework with the feature representations of words and images.\nDistributional approaches are very common for capturing semantic similarity between words. In\nthese approaches, words are represented as vectors of distributional characteristics \u2013 most often their\nco-occurrences with words in context [26, 9, 1, 32]. These representations have proven very effective\nin natural language processing tasks such as sense disambiguation [30], thesaurus extraction [24, 8]\nand cognitive modeling [22].\nUnless otherwise mentioned, all word vectors are initialized with pre-trained d = 50-dimensional\nword vectors from the unsupervised model of Huang et al. [15]. Using free Wikipedia text, their\nmodel learns word vectors by predicting how likely it is for each word to occur in its context. Their\nmodel uses both local context in the window around each word and global document contex, thus\ncapturing distributional syntactic and semantic information. For further details and evaluations of\nthese embeddings, see [3, 7].\nWe use the unsupervised method of Coates et al. [6] to extract I image features from raw pixels in\nan unsupervised fashion. Each image is henceforth represented by a vector x \u2208 RI.\n4 Projecting Images into Semantic Word Spaces\nIn order to learn semantic relationships and class membership of images we project the image feature\nvectors into the d-dimensional, semantic word space F . During training and testing, we consider\na set of classes Y . Some of the classes y in this set will have available training data, others will\nbe zero-shot classes without any training data. We de\ufb01ne the former as the seen classes Ys and the\nlatter as the unseen classes Yu. Let W = Ws \u222a Wu be the set of word vectors in Rd for both seen\nand unseen visual classes, respectively.\nAll training images x(i) \u2208 Xy of a seen class y \u2208 Ys are mapped to the word vector wy correspond-\ning to the class name. To train this mapping, we train a neural network to minimize the following\nobjective function :\n\nJ(\u0398) =\n\n,\n\n(1)\n\n(cid:88)\n\n(cid:88)\n\ny\u2208Ys\n\nx(i)\u2208Xy\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)wy \u2212 \u03b8(2)f\n\n(cid:16)\n\n\u03b8(1)x(i)(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\nwhere \u03b8(1) \u2208 Rh\u00d7I, \u03b8(2) \u2208 Rd\u00d7h and the standard nonlinearity f = tanh. We de\ufb01ne \u0398 =\n(\u03b8(1), \u03b8(2)). A two-layer neural network is shown to outperform a single linear mapping in the\nexperiments section below. The cost function is trained with standard backpropagation and L-BFGS.\nBy projecting images into the word vector space, we implicitly extend the semantics with a visual\ngrounding, allowing us to query the space, for instance for prototypical visual instances of a word.\nFig. 2 shows a visualization of the 50-dimensional semantic space with word vectors and images\nof both seen and unseen classes. The unseen classes are cat and truck. The mapping from 50 to 2\ndimensions was done with t-SNE [33]. We can observe that most classes are tightly clustered around\ntheir corresponding word vector while the zero-shot classes (cat and truck for this mapping) do not\nhave close-by vectors. However, the images of the two zero-shot classes are close to semantically\nsimilar classes (such as in the case of cat, which is close to dog and horse but is far away from car\nor ship). This observation motivated the idea for \ufb01rst detecting images of unseen classes and then\nclassifying them to the zero-shot word vectors.\n5 Zero-Shot Learning Model\nIn this section we \ufb01rst give an overview of our model and then describe each of its components.\nIn general, we want to predict p(y|x), the conditional probability for both seen and unseen classes\ny \u2208 Ys \u222a Yu given an image from the test set x \u2208 Xt. To achieve this we will employ the semantic\nvectors to which these images have been mapped to f \u2208 Ft.\nBecause standard classi\ufb01ers will never predict a class that has no training examples, we introduce\na binary novelty random variable which indicates whether an image is in a seen or unseen class\n\n3\n\n\fFigure 2: T-SNE visualization of the semantic word space. Word vector locations are highlighted\nand mapped image locations are shown both for images for which this mapping has been trained and\nunseen images. The unseen classes are cat and truck.\nV \u2208 {s, u}. Let Xs be the set of all feature vectors for training images of seen classes and Fs their\ncorresponding semantic vectors. We similarly de\ufb01ne Fy to be the semantic vectors of class y. We\npredict a class y for a new input image x and its mapped semantic vector f via:\n\np(y|x, Xs, Fs, W, \u03b8) =\n\nP (y|V, x, Xs, Fs, W, \u03b8)P (V |x, Xs, Fs, W, \u03b8).\n\n(cid:88)\n\nV \u2208{s,u}\n\nMarginalizing out the novelty variable V allows us to \ufb01rst distinguish between seen and unseen\nclasses. Each type of image can then be classi\ufb01ed differently. The seen image classi\ufb01er can be a\nstate of the art softmax classi\ufb01er while the unseen classi\ufb01er can be a simple Gaussian discriminator.\n5.1 Strategies for Novelty Detection\nWe now consider two strategies for predicting whether an image is of a seen or unseen class. The\nterm P (V = u|x, Xs, Fs, W, \u03b8) is the probability of an image being in an unseen class. An image\nfrom an unseen class will not be very close to the existing training images but will still be roughly\nin the same semantic region. For instance, cat images are closest to dogs even though they are not\nas close to the dog word vector as most dog images are. Hence, at test time, we can use outlier\ndetection methods to determine whether an image is in a seen or unseen class.\nWe compare two strategies for outlier detection. Both are computed on the manifold of training\nimages that were mapped to the semantic word space. The \ufb01rst method is relatively liberal in its\nassessment of novelty. It uses simple thresholds on the marginals assigned to each image under iso-\nmetric, class-speci\ufb01c Gaussians. The mapped points of seen classes are used to obtain this marginal.\nFor each seen class y \u2208 Ys, we compute P (x|Xy, wy, Fy, \u03b8) = P (f|Fy, wy) = N (f|wy, \u03a3y). The\nGaussian of each class is parameterized by the corresponding semantic word vector wy for its mean\nand a covariance matrix \u03a3y that is estimated from all the mapped training points with that label. We\nrestrict the Gaussians to be isometric to prevent over\ufb01tting. For a new image x, the outlier detector\nthen becomes the indicator function that is 1 if the marginal probability is below a certain threshold\nTy for all the classes:\n\nP (V = u|f, Xs, W, \u03b8) := 1{\u2200y \u2208 Ys : P (f|Fy, wy) < Ty}\n\nWe provide an experimental analysis for various thresholds T below. The thresholds are selected\nto make at least some fraction of the vectors from training images above threshold, that is, to be\nclassi\ufb01ed as a seen class. Intuitively, smaller thresholds result in fewer images being labeled as\nunseen. The main drawback of this method is that it does not give a real probability for an outlier.\n\n4\n\n  airplaneautomobilebirdcatdeerdogfroghorseshiptruckcatautomobiletruckfrogshipairplanehorsebirddogdeer\fAn alternative would be to use the method of [17] to obtain an actual outlier probability in an unsu-\npervised way. Then, we can obtain the conditional class probability using a weighted combination\nof classi\ufb01ers for both seen and unseen classes (described below). Fig. 2 shows that many unseen\nimages are not technically outliers of the complete data manifold. Hence this method is very con-\nservative in its assignment of novelty and therefore preserves high accuracy for seen classes.\nWe need to slightly modify the original approach since we distinguish between training and test\nsets. We do not want to use the set of all test images since they would then not be considered\noutliers anymore. The modi\ufb01ed version has the same two parameters: k = 20, the number of\nnearest neighbors that are considered to determine whether a point is an outlier and \u03bb = 3, which\ncan be roughly seen as a multiplier on the standard deviation. The larger it is, the more a point has\nto deviate from the mean in order to be considered an outlier.\nFor each point f \u2208 Ft, we de\ufb01ne a context set C(f ) \u2286 Fs of k nearest neighbors in the training set\nof seen categories. We can compute the probabilistic set distance pdist of each point x to the points\nin C(f ):\n\n(cid:115)(cid:80)\n\npdist\u03bb(f, C(f )) = \u03bb\n\nq\u2208C(f ) d(f, q)2\n\n|C(f )|\n\n,\n\nwhere d(f, q) de\ufb01nes some distance function in the word space. We use Euclidean distances. Next\nwe de\ufb01ne the local outlier factor:\n\nlof\u03bb(f ) =\n\npdist\u03bb(f, C(f ))\n\nEq\u223cC(f )[pdist\u03bb(f, C(q))]\n\n\u2212 1.\n\nLarge lof values indicate increasing outlierness. In order to obtain a probability, we next de\ufb01ne a\nnormalization factor Z that can be seen as a kind of standard deviation of lof values in the training\nset of seen classes:\n\n(cid:113)Eq\u223cFs[(lof(q))2].\n(cid:19)(cid:27)\n(cid:26)\n(cid:18) lof\u03bb(f )\n\nZ\u03bb(Fs)\n\nNow, we can de\ufb01ne the Local Outlier Probability:\n\nZ\u03bb(Fs) = \u03bb\n\nLoOP (f ) = max\n\n0, erf\n\n,\n\n(2)\n\nwhere erf is the Gauss Error function. This probability can now be used to weigh the seen and unseen\nclassi\ufb01ers by the appropriate amount given our belief about the outlierness of a new test image.\n5.2 Classi\ufb01cation\nIn the case where V = s, i.e.\nthe point is considered to be of a known class, we can use any\nprobabilistic classi\ufb01er for obtaining P (y|V = s, x, Xs). We use a softmax classi\ufb01er on the original\nI-dimensional features. For the zero-shot case where V = u we assume an isometric Gaussian\ndistribution around each of the novel class word vectors and assign classes based on their likelihood.\n6 Experiments\nFor most of our experiments we utilize the CIFAR-10 dataset [18]. The dataset has 10 classes, each\nwith 5,000 32 \u00d7 32 \u00d7 3 RGB images. We use the unsupervised feature extraction method of Coates\nand Ng [6] to obtain a 12,800-dimensional feature vector for each image. For word vectors, we use\na set of 50-dimensional word vectors from the Huang dataset [15] that correspond to each CIFAR\ncategory. During training, we omit two of the 10 classes and reserve them for zero-shot analysis.\nThe remaining categories are used for training.\nIn this section we \ufb01rst analyze the classi\ufb01cation performance for seen classes and unseen classes\nseparately. Then, we combine images from the two types of classes, and discuss the trade-offs\ninvolved in our two unseen class detection strategies. Next, the overall performance of the entire\nclassi\ufb01cation pipeline is summarized and compared to another popular approach by Lampert et al.\n[21]. Finally, we run a few additional experiments to assess quality and robustness of our model.\n6.1 Seen and Unseen Classes Separately\nFirst, we evaluate the classi\ufb01cation accuracy when presented only with images from classes that\nhave been used in training. We train a softmax classi\ufb01er to label one of 8 classes from CIFAR-10\n(2 are reserved for zero-shot learning). In this case, we achieve an accuracy of 82.5% on the set of\n\n5\n\n\fFigure 4: Comparison of accuracies for images from previously seen and unseen categories when\nunseen images are detected under the (a) Gaussian threshold model, (b) LoOP model. The average\naccuracy on all images is shown in (c) for both models. We also show a line corresponding to the\nsingle accuracy achieved in the Bayesian pipeline. In these examples, the zero-shot categories are\n\u201ccat\u201d and \u201ctruck\u201d.\nclasses excluding cat and truck, which closely matches the SVM-based classi\ufb01cation results in the\noriginal Coates and Ng paper [6] that used all 10 classes.\nWe now focus on classi\ufb01cation between only two zero-shot classes. In this case, the classi\ufb01cation is\nbased on isometric Gaussians which amounts to simply comparing distances between word vectors\nof unseen classes and an image mapped into semantic space. In this case, the performance is good\nif there is at least one seen class similar to the zero-shot class. For instance, when cat and dog are\ntaken out from training, the resulting zero-shot classi\ufb01cation does not work well because none of the\nother 8 categories is similar enough to both images to learn a good semantic distinction. On the other\nhand, if cat and truck are taken out, then the cat vectors can be mapped to the word space thanks to\nsimilarities to dogs and trucks can be distinguished thanks to car, yielding better performance.\nFig. 3 shows the accuracy achieved in distin-\nguishing images belonging to various combina-\ntions of zero-shot classes. We observe, as ex-\npected, that the maximum accuracy is achieved\nwhen choosing semantically distinct categories.\nFor instance, frog-truck and cat-truck do very\nwell. The worst accuracy is obtained when cat\nand dog are chosen instead. From the \ufb01gure we\nsee that for certain combinations of zero-shot\nclasses, we can achieve accuracies up to 90%.\n6.2\nof Novelty Detectors on Average Accuracy\nOur next area of investigation is to determine\nthe average performance of the classi\ufb01er for\nthe overall dataset that includes both seen and\nunseen images. We compare the performance\nwhen each image is passed through either of the two novelty detectors which decide with a certain\nprobability (in the second scenario) whether an image belongs to a class that was used in training.\nDepending on this choice, the image is either passed through the softmax classi\ufb01er for seen category\nimages, or assigned to the class of the nearest semantic word vector for unseen category images.\nFig. 4 shows the accuracies for test images for different choices made by the two scenarios for\nnovelty detection. The test set includes an equal number of images from each category, with 8\ncategories having been seen before, and 2 being new. We plot the accuracies of the two types\nof images separately for comparison. Firstly, at the left extreme of the curve, the Gaussian unseen\nimage detector treats all of the images as unseen, and the LoOP model takes the probability threshold\nfor an image being unseen to be 0. At this point, with all unseen images in the test set being treated\nas such, we achieve the highest accuracies, at 90% for this zero-shot pair. Similarly, at the other\nextreme of the curve, all images are classi\ufb01ed as belonging to a seen category, and hence the softmax\nclassi\ufb01er for seen images gives the best possible accuracy for these images.\n\nFigure 3: Visualization of classi\ufb01cation accuracy\nachieved for unseen images, for different choices\nof zero-shot classes selected before training.\n\nIn\ufb02uence\n\n6\n\n00.20.40.60.8100.10.20.30.40.50.60.70.80.91(a) Gaussian modelFraction of points classified as unseenAccuracy00.20.40.60.8100.10.20.30.40.50.60.70.80.91(b) LoOP modelOutlier probability thresholdAccuracy00.20.40.60.810.10.20.30.40.50.60.70.8(c) ComparisonFraction unseen/outlier thresholdAccuracy  GaussianLoOP0.58667seen classesseen classesunseen classesunseen classes0.6557cat\u2212dogplane\u2212autoauto\u2212deerdeer\u2212shipcat\u2212truck00.10.20.30.40.50.60.70.80.91Pair of zero\u2212shot classes usedZero\u2212shot accuracy\fBetween the extremes, the curves for unseen image accuracies and seen image accuracies fall and\nrise at different rates. Since the Gaussian model is liberal in designating an image as belonging to an\nunseen category, it treats more of the images as unseen, and hence we continue to get high unseen\nclass accuracies along the curve. The LoOP model, which tries to detect whether an image could\nbe regarded as an outlier for each class, does not assign very high outlier probabilities to zero-shot\nimages due to a large number of them being spread on inside the manifold of seen images (see Fig. 2\nfor a 2-dimensional visualization of the originally 50-dimensional space). Thus, it continues to treat\nthe majority of images as seen, leading to high seen class accuracies. Hence, the LoOP model can\nbe used in scenarios where one does not want to degrade the high performance on classes from the\ntraining set but allow for the possibility of unseen classes.\nWe also see from Fig. 4 (c) that since most images in the test set belong to previously seen categories,\nthe LoOP model, which is conservative in assigning the unseen label, gives better overall accuracies\nthan the Gaussian model. In general, we can choose an acceptable threshold for seen class accuracy\nand achieve a corresponding unseen class accuracy. For example, at 70% seen class accuracy in the\nGaussian model, unseen classes can be classi\ufb01ed with accuracies of between 30% to 15%, depending\non the class. Random chance is 10%.\n6.3 Combining predictions for seen and unseen classes\nThe \ufb01nal step in our experiments is to perform the full Bayesian pipeline as de\ufb01ned by Equation 2.\nWe obtain a prior probability of an image being an outlier. The LoOP model outputs a probability\nfor the image instance being an outlier, which we use directly. For the Gaussian threshold model, we\ntune a cutoff fraction for log probabilities beyond which images are classi\ufb01ed as outliers. We assign\nprobabilities 0 and 1 to either side of this threshold. We show the horizontal lines corresponding to\nthe overall accuracy for the Bayesian pipeline on Figure 4.\n6.4 Comparison to attribute-based classi\ufb01cation\nTo establish a context for comparing our model performance, we also run the attribute-based classi-\n\ufb01cation approach outlined by Lampert et al. [21]. We construct an attribute set of 25 attributes high-\nlighting different aspects of the CIFAR-10 dataset, with certain aspects dealing with animal-based\nattributes, and others dealing with vehicle-based attributes. We train each binary attribute classi\ufb01er\nseparately, and use the trained classi\ufb01ers to construct attribute labels for unseen classes. Finally,\nwe use MAP prediction to determine the \ufb01nal output class. The table below shows a summary of\nresults. Our overall accuracies for both models outperform the attribute-based model.\n\nBayesian pipeline (Gaussian)\nBayesian pipeline (LoOP)\nAttribute-based (Lampert et al.)\n\n74.25%\n65.31%\n45.25%\n\nIn general, an advantage of our approach is the ability to adapt to a domain quickly, which is dif\ufb01cult\nin the case of the attribute-based model, since appropriate attribute types need to be carefully picked.\n6.5 Novelty detection in original feature space\nThe analysis of novelty detectors in 6.2 involves calcula-\ntion in the word space. As a comparison, we perform the\nsame experiments with the Gaussian model in the origi-\nnal feature space. In the mapped space, we observe that\nof the 100 images assigned the highest probability of be-\ning an outlier, 12% of those images are false positives. On\nthe other hand, in the original feature space, the false pos-\nitive rate increases to 78%. This is intuitively explained\nby the fact that the mapping function gathers extra seman-\ntic information from the word vectors it is trained on, and\nimages are able to cluster better around these assumed\nGaussian centroids. In the original space, there is no se-\nmantic information, and the Gaussian centroids need to\nbe inferred from among the images themselves, which are\nnot truly representative of the center of the image space\nfor their classes.\n6.6 Extension to\nCIFAR-100 and Analysis of Deep Semantic Mapping\nSo far, our tests were on the CIFAR-10 dataset. We\nnow describe results on the more challenging CIFAR-100\n\nFigure 5: Comparison of accuracies\nfor images from previously seen and\nunseen categories\nthe modi\ufb01ed\nCIFAR-100 dataset, after training the\nsemantic mapping with a one-layer net-\nwork and two-layer network.\nThe\ndeeper mapping function performs bet-\nter.\n\nfor\n\n7\n\n00.20.40.60.8100.20.40.60.81Fraction of points classified as seenAccuracy  1\u2212layer NN2\u2212layer NNunseen accuraciesseen accuracies\fdataset [18], which consists of 100 classes, with 500 32 \u00d7 32 \u00d7 3 RGB images in each class. We\nremove 4 categories for which no vector representations were available in our vocabulary. We then\ncombined the CIFAR-10 dataset to get a set of 106 classes. Six zero-shot classes were chosen: \u2018for-\nest\u2019, \u2018lobster\u2019, \u2018orange\u2019, \u2018boy\u2019, \u2018truck\u2019, and \u2018cat\u2019. As before, we train a neural network to map the\nvectors into semantic space. With this setup, we get a peak non-zero-shot accuracy of 52.7%, which\nis almost near the baseline on 100 classes [16]. When all images are labeled as zero shot, the peak\naccuracy for the 6 unseen classes is 52.7%, where chance would be at 16.6%.\nBecause of the large semantic space corresponding to 100 classes, the proximity of an image to\nits appropriate class vector is dependent on the quality of the mapping into semantic space. We\nhypothesize that in this scenario a two layer neural network as described in Sec. 4 will perform\nbetter than a single layer or linear mapping. Fig. 5 con\ufb01rms this hypothesis. The zero-shot accuracy\nis 10% higher with a 2 layer neural net compared to a single layer with 42.2%.\n\n6.7 Zero-Shot Classes with Distractor Words\n\nWe would like zero-shot images to be classi-\n\ufb01ed correctly when there are a large number\nof unseen categories to choose from. To eval-\nuate such a setting with many possible but in-\ncorrect unseen classes we create a set of dis-\ntractor words. We compare two scenarios. In\nthe \ufb01rst, we add random nouns to the semantic\nspace. In the second, much harder, setting we\nadd the k nearest neighbors of a word vector.\nWe then evaluate classi\ufb01cation accuracy with\neach new set. For the zero-shot class cat and\ntruck, the nearest neighbors distractors include\nrabbit, kitten and mouse, among others.\nThe accuracy does not change much if random\ndistractor nouns are added. This shows that the\nsemantic space is spanned well and our zero-\nshot learning model is quite robust. Fig. 6\nshows the classi\ufb01cation accuracies for the second scenario. Here, accuracy drops as an increas-\ning number of semantically related nearest neighbors are added to the distractor set. This is to be\nexpected because there are not enough related categories to accurately distinguish very similar cat-\negories. After a certain number, the effect of a new distractor word is small. This is consistent with\nour expectation that a certain number of closely-related semantic neighbors would distract the clas-\nsi\ufb01er; however, beyond that limited set, other categories would be further away in semantic space\nand would not affect classi\ufb01cation accuracy.\n\nFigure 6: Visualization of the zero-shot classi\ufb01-\ncation accuracy when distractor words from the\nnearest neighbor set of a given category are also\npresent.\n\n7 Conclusion\nWe introduced a novel model for jointly doing standard and zero-shot classi\ufb01cation based on deep\nlearned word and image representations. The two key ideas are that (i) using semantic word vector\nrepresentations can help to transfer knowledge between modalities even when these representations\nare learned in an unsupervised way and (ii) that our Bayesian framework that \ufb01rst differentiates novel\nunseen classes from points on the semantic manifold of trained classes can help to combine both\nzero-shot and seen classi\ufb01cation into one framework. If the task was only to differentiate between\nvarious zero-shot classes we could obtain accuracies of up to 90% with a fully unsupervised model.\n\nAcknowledgments\nRichard is partly supported by a Microsoft Research PhD fellowship. The authors gratefully acknowledge\nthe support of the Defense Advanced Research Projects Agency (DARPA) Deep Exploration and Filtering of\nText (DEFT) Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-13-2-0040,\nthe DARPA Deep Learning program under contract number FA8650-10-C-7020 and NSF IIS-1159679. Any\nopinions, \ufb01ndings, and conclusions or recommendations expressed in this material are those of the authors and\ndo not necessarily re\ufb02ect the view of DARPA, AFRL, or the US government.\n\n8\n\n0102030400.20.30.40.50.60.70.80.91Number of distractor wordsAccuracy  Neighbors of catNeighbors of truck\fReferences\n[1] M. Baroni and A. Lenci. Distributional memory: A general framework for corpus-based semantics.\n\nComputational Linguistics, 36(4):673\u2013721, 2010.\n\n[2] E. Bart and S. Ullman. Cross-generalization: learning novel classes from a single example by feature\n\nreplacement. In CVPR, 2005.\n\n[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. J. Mach.\n\nLearn. Res., 3, March 2003.\n\n[4] J. Blitzer, M. Dredze, and F. Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain\n\nAdaptation for Sentiment Classi\ufb01cation. In ACL, 2007.\n\n[5] E. Bruni, G. Boleda, M. Baroni, and N. Tran. Distributional semantics in technicolor. In ACL, 2012.\n[6] A. Coates and A. Ng. The Importance of Encoding Versus Training with Sparse Coding and Vector\n\nQuantization . In ICML, 2011.\n\n[7] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: deep neural networks\n\nwith multitask learning. In ICML, 2008.\n\n[8] J. Curran. From Distributional to Semantic Similarity. PhD thesis, University of Edinburgh, 2004.\n[9] K. Erk and S. Pad\u00b4o. A structured vector space model for word meaning in context. In EMNLP, 2008.\n[10] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009.\n[11] Y. Feng and M. Lapata. Visual information in semantic representation. In HLT-NAACL, 2010.\n[12] M. Fink. Object classi\ufb01cation from a single example utilizing class relevance pseudo-metrics. In NIPS,\n\n2004.\n\n[13] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for Large-Scale sentiment classi\ufb01cation: A deep\n\nlearning approach. In ICML, 2011.\n\n[14] D. Hoiem, A.A. Efros, and M. Herbert. Geometric context from a single image. In ICCV, 2005.\n[15] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng.\n\nImproving Word Representations via Global\n\nContext and Multiple Word Prototypes. In ACL, 2012.\n\n[16] Yangqing Jia, Chang Huang, and T. Darrell. Beyond spatial pyramids: Receptive \ufb01eld learning for pooled\nimage features. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages\n3370 \u20133377, june 2012.\n\n[17] H. Kriegel, P. Kr\u00a8oger, E. Schubert, and A. Zimek. LoOP: local Outlier Probabilities. In Proceedings of\n\nthe 18th ACM conference on Information and knowledge management, CIKM \u201909, 2009.\n\n[18] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master\u2019s thesis, Computer\n\nScience Department, University of Toronto, 2009.\n\n[19] R.; Perona L. Fei-Fei; Fergus. One-shot learning of object categories. TPAMI, 28, 2006.\n[20] B. M. Lake, J. Gross R. Salakhutdinov, and J. B. Tenenbaum. One shot learning of simple visual concepts.\n\nIn Proceedings of the 33rd Annual Conference of the Cognitive Science Society, 2011.\n\n[21] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to Detect Unseen Object Classes by Between-\n\nClass Attribute Transfer. In CVPR, 2009.\n\n[22] T. K. Landauer and S. T. Dumais. A solution to Plato\u2019s problem: the Latent Semantic Analysis theory of\n\nacquisition, induction and representation of knowledge. Psychological Review, 104(2):211\u2013240, 1997.\n\n[23] C.W. Leong and R. Mihalcea. Going beyond text: A hybrid image-text approach for measuring word\n\nrelatedness. In IJCNLP, 2011.\n\n[24] D. Lin. Automatic retrieval and clustering of similar words.\n\n768\u2013774, 1998.\n\nIn Proceedings of COLING-ACL, pages\n\n[25] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y. Ng. Multimodal deep learning. In ICML, 2011.\n[26] S. Pado and M. Lapata. Dependency-based construction of semantic space models. Computational Lin-\n\nguistics, 33(2):161\u2013199, 2007.\n\n[27] M. Palatucci, D. Pomerleau, G. Hinton, and T. Mitchell. Zero-shot learning with semantic output codes.\n\nIn NIPS, 2009.\n\n[28] Guo-Jun Qi, C. Aggarwal, Y. Rui, Q. Tian, S. Chang, and T. Huang. Towards cross-category knowledge\n\npropagation for learning visual concepts. In CVPR, 2011.\n\n[29] A. Torralba R. Salakhutdinov, J. Tenenbaum. Learning to learn with compound hierarchical-deep models.\n\nIn NIPS, 2012.\n\n[30] H. Sch\u00a8utze. Automatic word sense discrimination. Computational Linguistics, 24:97\u2013124, 1998.\n\n9\n\n\f[31] R. Socher and L. Fei-Fei. Connecting modalities: Semi-supervised segmentation and annotation of images\n\nusing unaligned text corpora. In CVPR, 2010.\n\n[32] P. D. Turney and P. Pantel. From frequency to meaning: Vector space models of semantics. Journal of\n\nArti\ufb01cial Intelligence Research, 37:141\u2013188, 2010.\n\n[33] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research,\n\n2008.\n\n10\n\n\f", "award": [], "sourceid": 507, "authors": [{"given_name": "Richard", "family_name": "Socher", "institution": "Stanford University"}, {"given_name": "Milind", "family_name": "Ganjoo", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "Manning", "institution": "Stanford University"}, {"given_name": "Andrew", "family_name": "Ng", "institution": "Stanford University"}]}