{"title": "Introspective Classification with Convolutional Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 823, "page_last": 833, "abstract": "We propose introspective convolutional networks (ICN) that emphasize the importance of having convolutional neural networks empowered with generative capabilities. We employ a reclassification-by-synthesis algorithm to perform training using a formulation stemmed from the Bayes theory. Our ICN tries to iteratively: (1) synthesize pseudo-negative samples; and (2) enhance itself by improving the classification. The single CNN classifier learned is at the same time generative --- being able to directly synthesize new samples within its own discriminative model. We conduct experiments on benchmark datasets including MNIST, CIFAR-10, and SVHN using state-of-the-art CNN architectures, and observe improved classification results.", "full_text": "Introspective Classi\ufb01cation with Convolutional Nets\n\nLong Jin\n\nUC San Diego\n\nlongjin@ucsd.edu\n\nJustin Lazarow\nUC San Diego\n\njlazarow@ucsd.edu\n\nZhuowen Tu\nUC San Diego\nztu@ucsd.edu\n\nAbstract\n\nWe propose introspective convolutional networks (ICN) that emphasize the im-\nportance of having convolutional neural networks empowered with generative\ncapabilities. We employ a reclassi\ufb01cation-by-synthesis algorithm to perform train-\ning using a formulation stemmed from the Bayes theory. Our ICN tries to iteratively:\n(1) synthesize pseudo-negative samples; and (2) enhance itself by improving the\nclassi\ufb01cation. The single CNN classi\ufb01er learned is at the same time generative\n\u2014 being able to directly synthesize new samples within its own discriminative\nmodel. We conduct experiments on benchmark datasets including MNIST, CIFAR-\n10, and SVHN using state-of-the-art CNN architectures, and observe improved\nclassi\ufb01cation results.\n\nIntroduction\n\n1\nGreat success has been achieved in obtaining powerful discriminative classi\ufb01ers via supervised\ntraining, such as decision trees [34], support vector machines [42], neural networks [23], boosting\n[7], and random forests [2]. However, recent studies reveal that even modern classi\ufb01ers like deep\nconvolutional neural networks [20] still make mistakes that look absurd to humans [11]. A common\nway to improve the classi\ufb01cation performance is by using more data, in particular \u201chard examples\u201d,\nto train the classi\ufb01er. Different types of approaches have been proposed in the past including\nbootstrapping [31], active learning [37], semi-supervised learning [51], and data augmentation [20].\nHowever, the approaches above utilize data samples that are either already present in the given\ntraining set, or additionally created by humans or separate algorithms.\nIn this paper, we focus on improving convolutional neural networks by endowing them with synthesis\ncapabilities to make them internally generative. In the past, attempts have been made to build\nconnections between generative models and discriminative classi\ufb01ers [8, 27, 41, 15]. In [44], a\nself supervised boosting algorithm was proposed to train a boosting algorithm by sequentially\nlearning weak classi\ufb01ers using the given data and self-generated negative samples; the generative via\ndiscriminative learning work in [40] generalizes the concept that unsupervised generative modeling\ncan be accomplished by learning a sequence of discriminative classi\ufb01ers via self-generated pseudo-\nnegatives. Inspired by [44, 40] in which self-generated samples are utilized, as well as recent success\nin deep learning [20, 9], we propose here an introspective convolutional network (ICN) classi\ufb01er and\nstudy how its internal generative aspect can bene\ufb01t CNN\u2019s discriminative classi\ufb01cation task. There is\na recent line of work using a discriminator to help with an external generator, generative adversarial\nnetworks (GAN) [10], which is different from our objective here. We aim at building a single CNN\nmodel that is simultaneously discriminative and generative.\nThe introspective convolutional networks (ICN) being introduced here have a number of properties. (1)\nWe introduce introspection to convolutional neural networks and show its signi\ufb01cance in supervised\nclassi\ufb01cation. (2) A reclassi\ufb01cation-by-synthesis algorithm is devised to train ICN by iteratively\naugmenting the negative samples and updating the classi\ufb01er. (3) A stochastic gradient descent\nsampling process is adopted to perform ef\ufb01cient synthesis for ICN. (4) We propose a supervised\nformulation to directly train a multi-class ICN classi\ufb01er. We show consistent improvement over\nstate-of-the-art CNN classi\ufb01ers (ResNet [12]) on benchmark datasets in the experiments.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Related work\nOur ICN method is directly related to the generative via discriminative learning framework [40].\nIt also has connection to the self-supervised learning method [44], which is focused on density\nestimation by combining weak classi\ufb01ers. Previous algorithms connecting generative modeling\nwith discriminative classi\ufb01cation [8, 27, 41, 15] fall in the category of hybrid models that are direct\ncombinations of the two. Some existing works on introspective learning [22, 3, 38] have a different\nscope to the problem being tackled here. Other generative modeling schemes such as MiniMax\nentropy [50], inducing features [6], auto-encoder [1], and recent CNN-based generative modeling\napproaches [48, 47] are not for discriminative classi\ufb01cation and they do not have a single model that\nis both generative and discriminative. Below we discuss the two methods most related to ICN, namely\ngenerative via discriminative learning (GDL) [40] and generative adversarial networks (GAN) [10].\nRelationship with generative via discriminative learning (GDL) [40]\nICN is largely inspired by GDL and it follows a similar pipeline developed in [40]. However, there is\nalso a large improvement of ICN to GDL, which is summarized below.\n\u2022 CNN vs. Boosting. ICN builds on top of convolutional neural networks (CNN) by explicitly\n\u2022 Supervised classi\ufb01cation vs. unsupervised modeling. ICN focuses on the supervised classi\ufb01-\ncation task with competitive results on benchmark datasets whereas GDL was originally applied\nto generative modeling and its power for the classi\ufb01cation task itself was not addressed.\n\u2022 SGD sampling vs. Gibbs sampling. ICN carries ef\ufb01cient SGD sampling for synthesis through\nbackpropagation which is much more ef\ufb01cient than the Gibbs sampling strategy used in GDL.\n\u2022 Single CNN vs. Cascade of classi\ufb01ers. ICN maintains a single CNN classi\ufb01er whereas GDL\n\u2022 Automatic feature learning vs. manually speci\ufb01ed features. ICN has greater representational\npower due to the end-to-end training of CNN whereas GDL relies on manually designed features.\n\nrevealing the introspectiveness of CNN whereas GDL adopts the boosting algorithm [7].\n\nconsists of a sequence of boosting classi\ufb01ers.\n\nComparison with Generative Adversarial Networks (GANs) [10]\nRecent efforts in adversarial learning [10] are also very interesting and worth comparing with.\n\u2022 Introspective vs. adversarial. ICN emphasizes being introspective by synthesizing samples\nfrom its own classi\ufb01er while GAN focuses on adversarial \u2014 using a distinct discriminator to\nguide the generator.\n\u2022 Supervised classi\ufb01cation vs. unsupervised modeling. The main focus of ICN is to develop a\nclassi\ufb01er with introspection to improve the supervised classi\ufb01cation task whereas GAN is mostly\nfor building high-quality generative models under unsupervised learning.\n\u2022 Single model vs. two separate models. ICN retains a CNN discriminator that is itself a generator\nwhereas GAN maintains two models, a generator and a discriminator, with the discriminator in\nGAN trained to classify between \u201creal\u201d (given) and \u201cfake\u201d (generated by the generator) samples.\n\u2022 Reclassi\ufb01cation-by-synthesis vs. minimax. ICN engages an iterative procedure, reclassi\ufb01cation-\nby-synthesis, stemmed from the Bayes theory whereas GAN has a minimax objective function to\noptimize. Training an ICN classi\ufb01er is the same as that for the standard CNN.\n\u2022 Multi-class formulation. In a GAN-family work [36], a semi-supervised learning task is devised\nby adding an additional \u201cnot-real\u201d class to the standard k classes in multi-class classi\ufb01cation; this\nresults in a different setting to the standard multi-class classi\ufb01cation with additional model param-\neters. ICN instead, aims directly at the supervised multi-class classi\ufb01cation task by maintaining\nthe same parameter setting within the softmax function without additional model parameters.\n\nLater developments alongside GAN [35, 36, 49, 3] share some similar aspects to GAN, which\nalso do not achieve the same goal as ICN does. Since the discriminator in GAN is not meant to\nperform the generic two-class/multi-class classi\ufb01cation task, some special settings for semi-supervised\nlearning [10, 35, 49, 3, 36] were created. ICN instead has a single model that is both generative and\ndiscriminative, and thus, an improvement to ICN\u2019s generator leads to a direct means to ameliorate its\ndiscriminator. Other work like [11] was motivated from an observation that adding small perturbations\nto an image leads to classi\ufb01cation errors that are absurd to humans; their approach is however taken\nby augmenting positive samples from existing input whereas ICN is able to synthesize new samples\nfrom scratch. A recent work proposed in [21] is in the same family of ICN, but [21] focuses on\nunsupervised image modeling using a cascade of CNNs.\n\n2\n\n\f3 Method\nThe pipeline of ICN is shown in Figure 1, which has an immediate improvement over GDL [40]\nin several aspects that have been described in the previous section. One particular gain of ICN is\nits representation power and ef\ufb01cient sampling process through backpropagation as a variational\nsampling strategy.\n3.1 Formulation\nWe start the discussion by introducing the basic formulation and borrow the notation from [40].\nLet x be a data sample (vector) and y \u2208 {\u22121, +1} be its label, indicating either a negative or\na positive sample (in multi-class classi\ufb01cation y \u2208 {1, ..., K}). We study binary classi\ufb01cation\n\ufb01rst. A discriminative classi\ufb01er computes p(y|x), the probability of x being positive or negative.\np(y = \u22121|x) + p(y = +1|x) = 1. A generative model instead models p(y, x) = p(x|y)p(y), which\ncaptures the underlying generation process of x for class y. In binary classi\ufb01cation, positive samples\nare of primary interest. Under the Bayes rule:\n\np(x|y = +1) =\n\np(y = +1|x)p(y = \u22121)\np(y = \u22121|x)p(y = +1)\n\np(x|y = \u22121),\n\n(1)\n\n(2)\n\nwhich can be further simpli\ufb01ed when assuming equal priors p(y = +1) = p(y = \u22121):\n\np(x|y = +1) =\n\np(y = +1|x)\n1 \u2212 p(y = +1|x)\n\np(x|y = \u22121).\n\nFigure 1: Schematic illustration of our reclassi\ufb01cation-by-synthesis algorithm for ICN training. The top-left\n\ufb01gure shows the input training samples where the circles in red are positive samples and the crosses in blue are\nthe negatives. The bottom \ufb01gures are the samples progressively self-generated by the classi\ufb01er in the synthesis\nsteps and the top \ufb01gures show the decision boundaries (in purple) progressively updated in the reclassi\ufb01cation\nsteps. Pseudo-negatives (purple crosses) are gradually generated and help tighten the decision boundaries.\nWe make two interesting and important observations from Eqn. (2): 1) p(x|y = +1) is dependent\non the faithfulness of p(x|y = \u22121), and 2) a classi\ufb01er C to report p(y = +1|x) can be made\nsimultaneously generative and discriminative. However, there is a requirement: having an in-\nformative distribution for the negatives p(x|y = \u22121) such that samples drawn x \u223c p(x|y = \u22121)\n\n3\n\nReclassification Step: training on the given training data + generated pseudo-negatives Synthesis Step:synthesize pseudo-negative samplesConvolutional Neural NetworksIntrospective Convolutional Networks:\u2022Synthesis\u2022ReclassificationInitial Classification (given training data)Final Classification (given training data +self-generated pseudo-negatives)SynthesisClassification\fhave good coverage to the entire space of x \u2208 Rm, especially for samples that are close to the\npositives x \u223c p(x|y = +1), to allow the classi\ufb01er to faithfully learn p(y = +1|x). There seems\nto exist a dilemma. In supervised learning, we are only given a set of limited amount of training\ndata, and a classi\ufb01er C is only focused on the decision boundary to separate the given samples and\nthe classi\ufb01cation on the unseen data may not be accurate. This can be seen from the top left plot\nin Figure 1. This motivates us to implement the synthesis part within learning \u2014 make a learned\ndiscriminative classi\ufb01er generate samples that pass its own classi\ufb01cation and see how different these\ngenerated samples are to the given positive samples. This allows us to attain a single model that has\ntwo aspects at the same time: a generative model for the positive samples and an improved classi\ufb01er\nfor the classi\ufb01cation.\nSuppose we are given a training set S = {(xi, yi), i = 1..n} and x \u2208 Rm and y \u2208 {\u22121, +1}. One\ncan directly train a discriminative classi\ufb01er C, e.g. a convolutional neural networks [23] to learn\np(y = +1|x), which is always an approximation due to various reasons including insuf\ufb01cient training\nsamples, generalization error, and classi\ufb01er limitations. Previous attempts to improve classi\ufb01cation\nby data augmentation were mostly done to add more positive samples [20, 11]; we instead argue\nthe importance of adding more negative samples to improve the classi\ufb01cation performance. The\ndilemma is that S = {(xi, yi), i = 1..n} is limited to the given data. For clarity, we now use p\u2212(x)\nto represent p(x|y = \u22121). Our goal is to augment the negative training set by generating confusing\npseudo-negatives to improve the classi\ufb01cation (note that in the end pseudo-negative samples drawn\nx \u223c p\u2212\nt (x) will become hard to distinguish from the given positive samples. Cross-validation\ncan be used to determine when using more pseudo-negatives is not reducing the validation error).\nWe call the samples drawn from x \u223c p\u2212\nt (x) pseudo-negatives (de\ufb01ned in [40]). We expand\ne = S \u222a St\nS = {(xi, yi), i = 1..n} by St\npn = {(xi,\u22121), i = n + 1, ..., n + tl}.\nSt\n\npn = \u2205 and for t \u2265 1\n\npn, where S0\n\npn includes all the pseudo-negative samples self-generated from our model up to time t. l indicates\nSt\nthe number of pseudo-negatives generated at each round. We de\ufb01ne a reference distribution p\u2212\nr (x) =\nU (x), where U (x) is a Gaussian distribution (e.g. N (0.0, 0.32) independently). We carry out\nlearning with t = 0...T to iteratively obtain qt(y = +1|x) and qt(y = \u22121|x) by updating classi\ufb01er\nC t on St\ne = S reports discriminative probability\npn. The initial classi\ufb01er C 0 on S0\nq0(y = +1|x). The reason for using q is because it is an approximation to the true p due to limited\nsamples drawn in Rm. At each time t, we then compute\n\ne = S \u222a St\n\nwhere Zt =(cid:82) qt(y=+1|x)\n\nset:\n\nqt(y=\u22121|x) p\u2212\nSt+1\npn = St\n\np\u2212\nt (x) =\n\n1\nZt\n\nqt(y = +1|x)\nqt(y = \u22121|x)\n\np\u2212\nr (x),\n\n(3)\n\nr (x)dx. Draw new samples xi \u223c p\u2212\n\nt (x) to expand the pseudo-negative\n\npn \u222a {(xi,\u22121), i = n + tl + 1, ..., n + (t + 1)l}.\n\n(4)\nWe name the speci\ufb01c training algorithm for our introspective convolutional network (ICN) classi\ufb01er\nreclassi\ufb01cation-by-synthesis, which is described in Algorithm 1. We adopt convolutional neural\nnetworks (CNN) classi\ufb01er to build an end-to-end learning framework with an ef\ufb01cient sampling\nprocess (to be discussed in the next section).\n3.2 Reclassi\ufb01cation-by-synthesis\nWe present our reclassi\ufb01cation-by-synthesis algorithm for ICN in this section. A schematic illustration\nis shown in Figure 1. A single CNN classi\ufb01er is being trained progressively which is simultaneously a\ndiscriminator and a generator. With the pseudo-negatives being gradually generated, the classi\ufb01cation\nboundary gets tightened, and hence yields an improvement to the classi\ufb01er\u2019s performance. The\nreclassi\ufb01cation-by-synthesis method is described in Algorithm 1. The key to the algorithm includes\ntwo steps: (1) reclassi\ufb01cation-step, and (2) synthesis-step, which will be discussed in detail below.\n3.2.1 Reclassi\ufb01cation-step\nThe reclassi\ufb01cation-step can be viewed as training a normal classi\ufb01er on the training set St\nwhere S = {(xi, yi), i = 1..n} and S0\nuse CNN as our base classi\ufb01er. When training a classi\ufb01er C t on St\nlearned in C t by a high-dimensional vector Wt = (w(0)\nparameters. w(1)\n\ne = S\u222aSt\npn = {(xi,\u22121), i = n + 1, ..., n + tl} for t \u2265 1. We\ne, we denote the parameters to be\n) which might consist of millions of\n) and w(0)\n\ndenotes the weights of the top layer combining the features \u03c6(x; w(0)\n\npn = \u2205. St\n\npn\n\n, w(1)\n\nt\n\nt\n\nt\n\nt\n\nt\n\n4\n\n\fcarries all the internal representations. Without loss of generality, we assume a sigmoid function for\nthe discriminative probability\n\nqt(y|x; Wt) = 1/(1 + exp{\u2212yw(1)\n\n\u00b7 \u03c6(x; w(0)\n) de\ufb01nes the feature extraction function for x. Both w(1)\n\n)}),\nand w(0)\n\nwhere \u03c6(x; w(0)\ncan be learned by\nthe standard stochastic gradient descent algorithm via backpropagation to minimize a cross-entropy\nloss with an additional term on the pseudo-negatives:\n\nt\n\nt\n\nt\n\nt\n\nt\n\nL(Wt) = \u2212 i=1..n(cid:88)\n\nln qt(yi|xi; Wt) \u2212 i=n+1..n+tl(cid:88)\n\n(xi,yi)\u2208S\n\n(xi,\u22121)\u2208St\n\npn\n\nln qt(\u22121|xi; Wt).\n\n(5)\n\nAlgorithm 1 Outline of the reclassi\ufb01cation-by-synthesis algorithm for discriminative classi\ufb01er\ntraining.\n\nr (x) = U (x) and train an initial CNN binary classi\ufb01er C 0\n\nInput: Given a set of training data S = {(xi, yi), i = 1..n} with x \u2208 Rm and y \u2208 {\u22121, +1}.\nInitialization: Obtain a reference distribution: p\u2212\non S, q0(y = +1|x). S0\npn = \u2205. U (x) is a zero mean Gaussian distribution.\nFor t=0..T\n\u2212\n1. Update the model: p\nt (x) = 1\nZt\n2. Synthesis-step: sample l pseudo-negative samples xi \u223c p\ncurrent model p\n3. Augment the pseudo-negative set with St+1\n4. Reclassi\ufb01cation-step: Update CNN classi\ufb01er to C t+1 on St+1\n5. t \u2190 t + 1 and go back to step 1 until convergence (e.g. no improvement on the validation set).\nEnd\n\n\u2212\nt (x) using an SGD sampling procedure.\n\ne = S \u222a St+1\n\nqt(y=+1|x)\nqt(y=\u22121|x) p\u2212\n\npn \u222a {(xi,\u22121), i = n + tl + 1, ..., n + (t + 1)l}.\n\n\u2212\nt (x), i = n + tl + 1, ..., n + (t + 1)l from the\n\npn , resulting in qt+1(y = +1|x).\n\npn = St\n\nr (x).\n\n3.2.2 Synthesis-step\nIn the reclassi\ufb01cation step, we obtain qt(y|x; Wt) which is then used to update p\u2212\nEqn. (3):\n\nqt(y = +1|x; Wt)\nqt(y = \u22121|x; Wt)\n\n1\nZt\n\np\u2212\nt (x) =\n\np\u2212\nr (x).\nIn the synthesis-step, our goal is to draw fair samples from p\u2212\nt (x) (fair samples refer to typical\nsamples by a sampling process after convergence w.r.t the target distribution). In [40], various Markov\nchain Monte Carlo techniques [28] including Gibbs sampling and Iterated Conditional Modes (ICM)\nhave been adopted, which are often slow. Motivated by the DeepDream code [32] and Neural\nr (x) by increasing qt(y=+1|x;Wt)\nArtistic Style work [9], we update a random sample x drawn from p\u2212\nqt(y=\u22121|x;Wt)\nusing backpropagation. Note that the partition function (normalization) Zt is a constant that is not\ndependent on the sample x. Let\n\n(6)\n\nt (x) according to\n\n\u00b7 \u03c6(x; w(0)\nand take its ln, which is nicely turned into the logit of qt(y = +1|x; Wt)\n\n= exp{w(1)\n\ngt(x) =\n\nqt(y = +1|x; Wt)\nqt(y = \u22121|x; Wt)\n\nt\n\nt\n\n)},\n\n(7)\n\n(8)\n\nln gt(x) = w(1)\n\n\u00b7 \u03c6(x; w(0)\nr (x), we directly increase w(1)T\n\nt\n\nt\n\n).\n\u03c6(x; w(0)\n\nStarting from x drawn from p\u2212\n) using stochastic gradient\nascent on x via backpropagation, which allows us to obtain fair samples subject to Eqn. (6). Gaussian\nnoise can be added to Eqn. (8) along the line of stochastic gradient Langevin dynamics [43] as\n\nt\n\nt\n\n\u2206x =\n\n\u2207(w(1)\n\nt\n\n\u0001\n2\n\n\u00b7 \u03c6(x; w(0)\n\nt\n\n)) + \u03b7\n\nwhere \u03b7 \u223c N (0, \u0001) is a Gaussian distribution and \u0001 is the step size that is annealed in the sampling\nprocess.\nSampling strategies. When conducting experiments, we carry out several strategies using stochastic\ngradient descent algorithm (SGD) and SGD Lagenvin including: i) early-stopping for the sampling\nprocess after x becomes positive (aligned with contrastive divergence [4] where a short Markov chain\nis simulated); ii) stopping at a large con\ufb01dence for x being positive, and iii) sampling for a \ufb01xed,\nlarge number of steps. Table 2 shows the results on these different options and no major differences\nin the classi\ufb01cation performance are observed.\n\n5\n\n\ft+1(x)] \u2264 KL[p+(x)||p\u2212\n\nBuilding connections between SGD and MCMC is an active area in machine learning [43, 5, 30]. In\n[43], combining SGD and additional Gaussian noise under annealed stepsize results in a simulation\nof Langevin dynamics MCMC. A recent work [30] further shows the similarity between constant\nSGD and MCMC, along with analysis of SGD using momentum updates. Our progressively learned\ndiscriminative classi\ufb01er can be viewed as carving out the feature space on \u03c6(x), which essentially\nbecomes an equivalent class for the positives; the volume of the equivalent class that satis\ufb01es\nthe condition is exponentially large, as analyzed in [46]. The probability landscape of positives\n(equivalent class) makes our SGD sampling process not particularly biased towards a small limited\nmodes. Results in Figure 2 illustrates that large variation of the sampled/synthesized examples.\n3.3 Analysis\nt (x) t=\u221e\u2192 p+(x) can be derived (see the supplementary material), inspired by\nThe convergence of p\u2212\nthe proof from [40]: KL[p+(x)||p\u2212\nt (x)] where KL denotes the Kullback-\nLeibler divergence and p(x|y = +1) \u2261 p+(x), under the assumption that classi\ufb01er at t + 1 improves\nover t.\nRemark. Here we pay particular attention to the negative samples which live in a space that is\noften much larger than the positive sample space. For the negative training samples, we have\nyi = \u22121 and xi \u223c Q\u2212(x), where Q\u2212(x) is a distribution on the given negative examples in\nthe original training set. Our reclassi\ufb01cation-by-synthesis algorithm (Algorithm 1) essentially\nconstructs a mixture model \u02dcp(x) \u2261 1\nt (x) by sequentially generating pseudo-negative\nsamples to augment our training set. Our new distribution for augmented negative sample set thus\nbecomes Q\u2212\nn+T l Q\u2212(x) + T l\nn+T l \u02dcp(x), where \u02dcp(x) encodes pseudo-negative samples that\nare confusing and similar to (but are not) the positives. In the end, adding pseudo-negatives might\ndegrade the classi\ufb01cation result since they become more and more similar to the positives. Cross-\nvalidation can be used to decide when adding more pseudo-negatives is not helping the classi\ufb01cation\ntask. How to better use the pseudo-negative samples that are increasingly faithful to the positives\nis an interesting topic worth further exploring. Our overall algorithm thus is capable of enhancing\nclassi\ufb01cation by self-generating confusing samples to improve CNN\u2019s robustness.\n3.4 Multi-class classi\ufb01cation\nOne-vs-all. In the above section, we discussed the binary classi\ufb01cation case. When dealing with\nmulti-class classi\ufb01cation problems, such as MNIST and CIFAR-10, we will need to adapt our\nproposed reclassi\ufb01cation-by-synthesis scheme to the multi-class case. This can be done directly using\na one-vs-all strategy by training a binary classi\ufb01er Ci using the i-th class as the positive class and\nthen combine the rest classes into the negative class, resulting in a total of K binary classi\ufb01ers. The\ntraining procedure then becomes identical to the binary classi\ufb01cation case. If we have K classes,\nthen the algorithm will train K individual binary classi\ufb01ers with\n, w(1)K\n\n(cid:80)T\u22121\nt=0 p\u2212\n\nnew(x) \u2261 n\n\n), ..., (w(0)K\n\n< (w(0)1\n\n, w(1)1\n\n) > .\n\nT\n\nt\n\nt\n\nt\n\nt\n\nThe prediction function is simply\n\nf (x) = arg max\n\nk\n\nexp{w(1)k\n\nt\n\n\u00b7 \u03c6(x; w(0)k\n\nt\n\n)}.\n\nThe advantage of using the one-vs-all strategy is that the algorithm can be made nearly identical to\nthe binary case at the price of training K different neural networks.\nSoftmax function.\nIt is also desirable to build a single CNN classi\ufb01er to perform multi-class\nclassi\ufb01cation directly. Here we propose a formulation to train an end-to-end multiclass classi\ufb01er\ndirectly. Since we are directly dealing with K classes, the pseudo-negative data set will be slightly\ndifferent and we introduce negatives for each individual class by S0\n\npn = \u2205 and:\n\npn = {(xi,\u2212k), k = 1, ..., K, i = n + (t \u2212 1) \u00d7 k \u00d7 l + 1, ..., n + t \u00d7 k \u00d7 l}\nSt\n\nSuppose we are given a training set S = {(xi, yi), i = 1..n} and x \u2208 Rm and y \u2208 {1, .., K}. We\nwant to train a single CNN classi\ufb01er with\n\nWt =< w(0)\n\nt\n\n, w(1)1\n\nt\n\n, ..., w(1)K\n\nt\n\n>\n\nt\n\ndenotes the internal feature and parameters for the single CNN, and w(1)k\n\nwhere w(0)\ntop-layer weights for the k-th class. We therefore minimize an integrated objective function\nt )})\n\nL(Wt)=\u2212(1\u2212\u03b1)(cid:80)n\n\n+\u03b1(cid:80)n+t\u00d7K\u00d7l\n\nln(1+exp{w\n\n\u00b7\u03c6(xi;w0\n\n(1)|yi|\n\n(0)\nt\n\ni=1 ln\n\ni=n+1\n\nt\n\nt\n\n(cid:80)K\n\n(1)yi\nexp{w\nt\nexp{w\n\nk=1\n\n\u00b7\u03c6(xi;w\n(1)k\nt\n\n\u00b7\u03c6(xi;w\n\n)}\n(0)\nt\n\n)}\n\ndenotes the\n\n(9)\n\n6\n\n\f(1)|yi|\nt\n\nThe \ufb01rst term in Eqn. (9) encourages a softmax loss on the original training set S. The second term\nin Eqn. (9) encourages a good prediction on the individual pseudo-negative class generated for the\nk-th class (indexed by |yi| for w\n, e.g. for pseudo-negative samples belong to the k-th class,\n|yi| = | \u2212 k| = k). \u03b1 is a hyperparameter balancing the two terms. Note that we only need to build\na single CNN sharing w(0)\nfor all the K classes. In particular, we are not introducing additional\nmodel parameters here and we perform a direct K-class classi\ufb01cation where the parameter setting is\nidentical to a standard CNN multi-class classi\ufb01cation task; to compare, an additional \u201cnot-real\u201d class\nis created in [36] and the classi\ufb01cation task there [36] thus becomes a K + 1 class classi\ufb01cation.\n4 Experiments\n\nt\n\nFigure 2: Synthesized pseudo-negatives for the MNIST dataset by our ICN classi\ufb01er. The top row shows some\ntraining examples. As t increases, our classi\ufb01er gradually synthesize pseudo-negative samples that become\nincreasingly faithful to the training samples.\nWe conduct experiments on three standard benchmark datasets, including MNIST, CIFAR-10 and\nSVHN. We use MNIST as a running example to illustrate our proposed framework using a shallow\nCNN; we then show competitive results using a state-of-the-art CNN classi\ufb01er, ResNet [12] on\nMNIST, CIFAR-10 and SVHN. In our experiments, for the reclassi\ufb01cation step, we use the SGD\noptimizer with mini-batch size of 64 (MNIST) or 128 (CIFAR-10 and SVHN) and momentum equal\nto 0.9; for the synthesis step, we use the Adam optimizer [17] with momentum term \u03b21 equal to 0.5.\nAll results are obtained by averaging multiple rounds.\nTraining and test time. In general, the training time for ICN is around double that of the baseline\nCNNs in our experiments: 1.8 times for MNIST dataset, 2.1 times for CIFAR-10 dataset and 1.7\ntimes for SVHN dataset. The added overhead in training is mostly determined by the number of\ngenerated pseudo-negative samples. For the test time, ICN introduces no additional overhead to the\nbaseline CNNs.\n4.1 MNIST\nWe use the standard MNIST [24] dataset, which\nconsists of 55, 000 training, 5, 000 validation\nand 10, 000 test samples. We adopt a simple\nnetwork, containing 4 convolutional layers, each\nhaving a 5 \u00d7 5 \ufb01lter size with 64, 128, 256 and\n512 channels, respectively. These convolutional\nlayers have stride 2, and no pooling layers are\nused. LeakyReLU activations [29] are used after\neach convolutional layer. The last convolutional\nlayer is \ufb02attened and fed into a sigmoid output\n(in the one-vs-all case).\nIn the reclassi\ufb01cation step, we run SGD (for 5\nepochs) on the current training data St\ne, includ-\ning previously generated pseudo-negatives. Our\ninitial learning rate is 0.025 and is decreased by\na factor of 10 at t = 25. In the synthesis step, we use the backpropagation sampling process as\ndiscussed in Section 3.2.2. In Table 2, we compare different sampling strategies. Each time we\nsynthesize a \ufb01xed number (200 in our experiments) of pseudo-negative samples.\nWe show some synthesized pseudo-negatives from the MNIST dataset in Figure 2. The samples in\nthe top row are from the original training dataset. ICN gradually synthesizes pseudo-negatives, which\nare increasingly faithful to the original data. Pseudo-negative samples will be continuously used\nwhile improving the classi\ufb01cation result.\n\nTable 1: Test errors on the MNIST dataset. We compare\nour ICN method with the baseline CNN, Deep Belief\nNetwork (DBN) [14], and CNN w/ Label Smoothing\n(LS) [39]. Moreover, the two-step experiments combin-\ning CNN + GDL [40] and combining CNN + DCGAN\n[35] are also reported, and see descriptions in text for\nmore details.\nMethod\nDBN\n\nCNN + DCGAN\nICN-noise (ours)\n\nCNN w/ LS\nCNN + GDL\n\n0.85\n0.84\n0.89\n0.78\n\nOne-vs-all (%)\n\nSoftmax (%)\n\nCNN (baseline)\n\n0.87\n\n-\n\n-\n\n1.11\n0.77\n0.69\n\n-\n-\n\n0.77\n0.72\n\nICN (ours)\n\n7\n\n\fOne-vs-all (%)\n\nSoftmax (%)\n\nTable 2: Comparison of different sampling strategies in the\nsynthesis step in ICN.\n\nComparison of different sampling\nstrategies. We perform SGD and SGD\nLangevin (with injected Gaussians), and\ntry several options via backpropagation\nfor the sampling strategies. Option 1:\nearly-stopping once the generated sam-\nples are classi\ufb01ed as positive; option 2:\nstopping at a high con\ufb01dence for sam-\nples being positive; option 3: stopping\nafter a large number of steps. Table 2\nshows the results and we do not observe signi\ufb01cant differences in these choices.\nAblation study. We experiment using random noise as synthesized pseudo-negatives in an ablation\nstudy. From Table 1, we observe that our ICN outperforms the CNN baseline and the ICN-noise\nmethod in both one-vs-all and softmax cases.\n\nSampling Strategy\nSGD (option 1)\nSGD Langevin (option 1)\nSGD (option 2)\nSGD Langevin (option 2)\nSGD (option 3)\nSGD Langevin (option 3)\n\n0.81\n0.80\n0.78\n0.78\n0.81\n0.80\n\n0.72\n0.72\n0.72\n0.74\n0.75\n0.73\n\nFigure 3: MNIST test error against the number of training examples (std dev. of the test error is also displayed).\nThe effect of ICN is more clear when having fewer training examples.\n\nEffects on varying training sizes. To better understand the effectiveness of our ICN method, we\ncarry out an experiment by varying the number of training examples. We use training sets with\ndifferent sizes including 500, 2000, 10000, and 55000 examples. The results are reported in Figure 3.\nICN is shown to be particularly effective when the training set is relatively small, since ICN has the\ncapability to synthesize pseudo-negatives by itself to aid training.\nComparison with GDL and GAN. GDL [40] focuses on unsupervised learning; GAN [10] and\nDCGAN [35] show results for unsupervised learning and semi-supervised classi\ufb01cation. To apply\nGDL and GAN to the supervised classi\ufb01cation setting, we design an experiment to perform a two-step\nimplementation. For GDL, we ran the GDL code [40] and obtained the pseudo-negative samples\nfor each individual digit; the pseudo-negatives are then used as augmented negative samples to\ntrain individual one-vs-all CNN classi\ufb01ers (using an identical CNN architecture to ICN for a fair\ncomparison), which are combined to form a multi-class classi\ufb01er in the end. To compare with\nDCGAN [35], we follow the same procedure: each generator trained by DCGAN [35] using the\nTensorFlow implementation [16] was used to generate positive samples, which are then augmented\nto the negative set to train the individual one-vs-all CNN classi\ufb01ers (also using an identical CNN\narchitecture to ICN), which are combined to create the overall multi-class classi\ufb01er. CNN+GDL\nachieves a test error of 0.85% and CNN+DCGAN achieves a test error of 0.84% on the MNIST\ndataset, whereas ICN reports an error of 0.78% using the same CNN architecture. As the supervised\nlearning task was not directly speci\ufb01ed in DCGAN [35], some care is needed to design the optimal\nsetting to utilize the generated samples from DCGAN in the two-step approach (we made attempts to\noptimize the results). GDL [40] can be made into a discriminative classi\ufb01er by utilizing the given\nnegative samples \ufb01rst but boosting [7] with manually designed features was adopted which may not\nproduce competitive results as CNN classi\ufb01er does. Nevertheless, the advantage of ICN being an\nintegrated end-to-end supervised learning single-model framework can be observed.\nTo compare with generative model based deep learning approach, we report the classi\ufb01cation result\nof DBN [14] in Table 1. DBN achieves a test error of 1.11% using the softmax function. We also\ncompare with Label Smoothing (LS), which has been used in [39] as a regularization technique by\nencouraging the model to be less con\ufb01dent. In LS, for a training example with ground-truth label,\nthe label distribution is replaced with a mixture of the original ground-truth distribution and a \ufb01xed\ndistribution. LS achieves a test error of 0.69% in the softmax case.\n\n8\n\n\fIn addition, we also adopt ResNet-32 [13] (using the softmax function) as another baseline CNN\nmodel, which achieves a test error of 0.50% on the MNIST dataset. Our ResNet-32 based ICN\nachieves an improved result of 0.47%.\nRobustness to external adversarial examples. To show the improved robustness of ICN in dealing\nwith confusing and challenging examples, we compare the baseline CNN with our ICN classi\ufb01er on\nadversarial examples generated using the \u201cfast gradient sign\u201d method from [11]. This \u201cfast gradient\nsign\u201d method (with \u0001 = 0.25) can cause a maxout network to misclassify 89.4% of adversarial\nexamples generated from the MNIST test set [11]. In our experiment, we set \u0001 = 0.125. Starting\nwith 10, 000 MNIST test examples, we \ufb01rst determine those which are correctly classi\ufb01ed by the\nbaseline CNN in order to generate adversarial examples from them. We \ufb01nd that 5, 111 generated\nadversarial examples successfully fool the baseline CNN, however, only 3, 134 of these examples\ncan fool our ICN classi\ufb01er, which is a 38.7% reduction in error against adversarial examples. Note\nthat the improvement is achieved without using any additional training data, nor knowing a prior\nabout how these adversarial examples are generated by the speci\ufb01c \u201cfast gradient sign method\u201d [11].\nOn the contrary, of the 2, 679 adversarial examples generated from the ICN classi\ufb01er side that fool\nICN using the same method, 2, 079 of them can still fool the baseline CNN classi\ufb01er. This two-way\nexperiment shows the improved robustness of ICN over the baseline CNN.\n\nMethod\n\nOne-vs-all (%)\n\nSoftmax (%)\n\nw/o Data Augmentation\n\nTable 3: Test errors on the CIFAR-10 dataset. In both one-\nvs-all and softmax cases, ICN shows improvement over the\nbaseline ResNet model. The result of convolutional DBN is\nfrom [19].\n\n4.2 CIFAR-10\nThe CIFAR-10 dataset [18] consists of\n60, 000 color images of size 32 \u00d7 32. This\nset of 60, 000 images is split into two sets,\n50, 000 images for training and 10, 000 im-\nages for testing. We adopt ResNet [13] as\nour baseline model [45]. For data augmen-\ntation, we follow the standard procedure\nin [26, 25, 13] by augmenting the dataset\nby zero-padding 4 pixels on each side; we\nalso perform cropping and random \ufb02ipping.\nThe results are reported in Table 3.\nIn\nboth one-vs-all and softmax cases, ICN\noutperforms the baseline ResNet classi\ufb01ers.\nOur proposed ICN method is orthogonal to\nmany existing approaches which use vari-\nous improvements to the network structures\nin order to enhance the CNN performance.\nWe also compare ICN with Convolutional\nDBN [19], ResNet-32 w/ Label Smoothing (LS) [39] and ResNet-32+DCGAN [35] methods as\ndescribed in the MNIST experiments. LS is shown to improve the baseline but is worse than our ICN\nmethod in most cases except for the MNIST dataset.\n\nw/ Data Augmentation\nResNet-32 (baseline)\n\nConvolutional DBN\nResNet-32 (baseline)\n\nResNet-32 + DCGAN\n\nResNet-32 w/ LS\n\nResNet-32 + DCGAN\n\nResNet-32 w/ LS\n\n6.70\n\n-\n\n6.75\n6.58\n6.52\n\nICN-noise (ours)\n\n7.06\n6.89\n\n-\n\n6.90\n6.70\n\nICN-noise (ours)\n\n12.99\n13.28\n12.94\n\n21.1\n12.38\n12.65\n\n-\n\n11.94\n11.46\n\n13.44\n\n-\n\n-\n\nICN (ours)\n\nICN (ours)\n\n4.3 SVHN\nWe use the standard SVHN [33] dataset. We combine the\ntraining data with the extra data to form our training set\nand use the test data as the test set. No data augmentation\nhas been applied. The result is reported in Table 4. ICN is\nshown to achieve competitive results.\n\n5 Conclusion\n\nTable 4: Test errors on the SVHN dataset.\n\nMethod\n\nSoftmax (%)\n\nResNet-32 (baseline)\n\nResNet-32 w/ LS\n\nResNet-32 + DCGAN\n\nICN-noise (ours)\n\nICN (ours)\n\n2.01\n1.96\n1.98\n1.99\n1.95\n\nIn this paper, we have proposed an introspective convolutional nets (ICN) algorithm that performs\ninternal introspection. We observe performance gains within supervised learning using state-of-the-art\nCNN architectures on standard machine learning benchmarks.\nAcknowledgement This work is supported by NSF IIS-1618477, NSF IIS-1717431, and a Northrop\nGrumman Contextual Robotics grant. We thank Saining Xie, Weijian Xu, Fan Fan, Kwonjoon Lee,\nShuai Tang, and Sanjoy Dasgupta for helpful discussions.\n\n9\n\n\fReferences\n\n[1] P. Baldi. Autoencoders, unsupervised learning, and deep architectures. In ICML Workshop on\n\nUnsupervised and Transfer Learning, pages 37\u201349, 2012.\n\n[2] L. Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n[3] A. Brock, T. Lim, J. Ritchie, and N. Weston. Neural photo editing with introspective adversarial\n\nnetworks. In ICLR, 2017.\n\n[4] M. A. Carreira-Perpinan and G. Hinton. On contrastive divergence learning. In AISTATS,\n\nvolume 10, pages 33\u201340, 2005.\n\n[5] T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient hamiltonian monte carlo. In ICML,\n\n2014.\n\n[6] S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random \ufb01elds. IEEE\n\ntransactions on pattern analysis and machine intelligence, 19(4):380\u2013393, 1997.\n\n[7] Y. Freund and R. E. Schapire. A Decision-theoretic Generalization of On-line Learning And\nAn Application to Boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n[8] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1.\n\nSpringer series in statistics Springer, Berlin, 2001.\n\n[9] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv preprint\n\narXiv:1508.06576, 2015.\n\n[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[11] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In\n\nICLR, 2015.\n\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European\n\nConference on Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[14] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural\n\ncomputation, 18(7):1527\u20131554, 2006.\n\n[15] T. Jebara. Machine learning: discriminative and generative, volume 755. Springer Science &\n\nBusiness Media, 2012.\n\n[16] T. Kim. DCGAN-tensor\ufb02ow. https://github.com/carpedm20/DCGAN-tensorflow.\n[17] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[18] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. CS Dept., U Toronto,\n\nTech. Rep., 2009.\n\n[19] A. Krizhevsky and G. Hinton. Convolutional deep belief networks on cifar-10. Unpublished\n\nmanuscript, 40, 2010.\n\n[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classi\ufb01cation with Deep Convolutional\n\nNeural Networks. In NIPS, 2012.\n\n[21] J. Lazarow, L. Jin, and Z. Tu. Introspective neural networks for generative modeling. In ICCV,\n\n2017.\n\n[22] D. B. Leake. Introspective learning and reasoning. In Encyclopedia of the Sciences of Learning,\n\npages 1638\u20131640. Springer, 2012.\n\n[23] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel.\nBackpropagation applied to handwritten zip code recognition. In Neural Computation, 1989.\n\n[24] Y. LeCun and C. Cortes. The MNIST database of handwritten digits, 1998.\n[25] C.-Y. Lee, P. W. Gallagher, and Z. Tu. Generalizing pooling functions in convolutional neural\n\nnetworks: Mixed, gated, and tree. In AISTATS, 2016.\n\n[26] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In AISTATS,\n\n2015.\n\n[27] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolike-\n\nlihood estimators. In ICML, 2008.\n\n10\n\n\f[28] J. S. Liu. Monte Carlo strategies in scienti\ufb01c computing. Springer Science & Business Media,\n\n2008.\n\n[29] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Recti\ufb01er nonlinearities improve neural network\n\nacoustic models. In ICML, 2013.\n\n[30] S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic gradient descent as approximate bayesian\n\ninference. arXiv preprint arXiv:1704.04289, 2017.\n\n[31] C. Z. Mooney, R. D. Duval, and R. Duvall. Bootstrapping: A nonparametric approach to\n\nstatistical inference. Number 94-95. Sage, 1993.\n\n[32] A. Mordvintsev, C. Olah, and M. Tyka. Deepdream - a code example for visualizing neural\n\nnetworks. Google Research, 2015.\n\n[33] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading Digits in Natural\nIn NIPS Workshop on Deep Learning and\n\nImages with Unsupervised Feature Learning.\nUnsupervised Feature Learning, 2011.\n\n[34] J. R. Quinlan. Improved use of continuous attributes in c4. 5. Journal of arti\ufb01cial intelligence\n\nresearch, 4:77\u201390, 1996.\n\n[35] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu-\n\ntional generative adversarial networks. In ICLR, 2016.\n\n[36] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\n\ntechniques for training gans. In NIPS, 2016.\n\n[37] B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11,\n\n2010.\n\n[38] A. Sinha, M. Sarkar, A. Mukherjee, and B. Krishnamurthy. Introspection: Accelerating neural\n\nnetwork training by learning weight evolution. arXiv preprint arXiv:1704.04959, 2017.\n\n[39] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec-\n\nture for computer vision. In CVPR, 2016.\n\n[40] Z. Tu. Learning generative models via discriminative approaches. In CVPR, 2007.\n[41] Z. Tu, K. L. Narr, P. Doll\u00e1r, I. Dinov, P. M. Thompson, and A. W. Toga. Brain anatomical\nstructure segmentation by hybrid discriminative/generative models. Medical Imaging, IEEE\nTransactions on, 27(4):495\u2013508, 2008.\n\n[42] V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., 1995.\n[43] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In\n\nICML, 2011.\n\n[44] M. Welling, R. S. Zemel, and G. E. Hinton. Self supervised boosting. In NIPS, 2002.\n[45] Y. Wu.\n\nhttps://github.com/ppwwyyxx/tensorpack/tree/\n\nTensorpack toolbox.\n\nmaster/examples/ResNet.\n\n[46] Y. N. Wu, S. C. Zhu, and X. Liu. Equivalence of julesz ensembles and frame models. Interna-\n\ntional Journal of Computer Vision, 38(3), 2000.\n\n[47] J. Xie, Y. Lu, S.-C. Zhu, and Y. N. Wu. Cooperative training of descriptor and generator\n\nnetworks. arXiv preprint arXiv:1609.09408, 2016.\n\n[48] J. Xie, Y. Lu, S.-C. Zhu, and Y. N. Wu. A theory of generative convnet. In ICML, 2016.\n[49] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR,\n\n2017.\n\n[50] S. C. Zhu, Y. N. Wu, and D. Mumford. Minimax entropy principle and its application to texture\n\nmodeling. Neural Computation, 9(8):1627\u20131660, 1997.\n\n[51] X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-\n\nMadison, Technical Report 1530, 2005.\n\n11\n\n\f", "award": [], "sourceid": 556, "authors": [{"given_name": "Long", "family_name": "Jin", "institution": "University of California San Diego"}, {"given_name": "Justin", "family_name": "Lazarow", "institution": "UC San Diego"}, {"given_name": "Zhuowen", "family_name": "Tu", "institution": "University of California, San Diego"}]}