{"title": "Semi-supervised Learning with GANs: Manifold Invariance with Improved Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 5534, "page_last": 5544, "abstract": "Semi-supervised learning methods using Generative adversarial networks (GANs) have shown promising empirical success recently. Most of these methods use a shared discriminator/classifier which discriminates real examples from fake while also predicting the class label. Motivated by the ability of the GANs generator to capture the data manifold well, we propose to estimate the tangent space to the data manifold using GANs and employ it to inject invariances into the classifier. In the process, we propose enhancements over existing methods for learning the inverse mapping (i.e., the encoder) which greatly improves in terms of semantic similarity of the reconstructed sample with the input sample. We observe considerable empirical gains in semi-supervised learning over baselines, particularly in the cases when the number of labeled examples is low. We also provide insights into how fake examples influence the semi-supervised learning procedure.", "full_text": "Semi-supervised Learning with GANs: Manifold\n\nInvariance with Improved Inference\n\nAbhishek Kumar\u2217\nIBM Research AI\n\nYorktown Heights, NY\nabhishk@us.ibm.com\n\nPrasanna Sattigeri\u2217\nIBM Research AI\n\nYorktown Heights, NY\npsattig@us.ibm.com\n\nP. Thomas Fletcher\nUniversity of Utah\nSalt Lake City, UT\n\nfletcher@sci.utah.edu\n\nAbstract\n\nSemi-supervised learning methods using Generative adversarial networks (GANs)\nhave shown promising empirical success recently. Most of these methods use a\nshared discriminator/classi\ufb01er which discriminates real examples from fake while\nalso predicting the class label. Motivated by the ability of the GANs generator to\ncapture the data manifold well, we propose to estimate the tangent space to the data\nmanifold using GANs and employ it to inject invariances into the classi\ufb01er. In the\nprocess, we propose enhancements over existing methods for learning the inverse\nmapping (i.e., the encoder) which greatly improves in terms of semantic similarity\nof the reconstructed sample with the input sample. We observe considerable\nempirical gains in semi-supervised learning over baselines, particularly in the cases\nwhen the number of labeled examples is low. We also provide insights into how\nfake examples in\ufb02uence the semi-supervised learning procedure.\n\n1\n\nIntroduction\n\nDeep generative models (both implicit [11, 23] as well as prescribed [16]) have become widely\npopular for generative modeling of data. Generative adversarial networks (GANs) [11] in particular\nhave shown remarkable success in generating very realistic images in several cases [30, 4]. The\ngenerator in a GAN can be seen as learning a nonlinear parametric mapping g : Z \u2192 X to the data\nmanifold. In most applications of interest (e.g., modeling images), we have dim(Z) (cid:28) dim(X). A\ndistribution pz over the space Z (e.g., uniform), combined with this mapping, induces a distribution\npg over the space X and a sample from this distribution can be obtained by ancestral sampling, i.e.,\nz \u223c pz, x = g(z). GANs use adversarial training where the discriminator approximates (lower\nbounds) a divergence measure (e.g., an f-divergence) between pg and the real data distribution px by\nsolving an optimization problem, and the generator tries to minimize this [28, 11]. It can also be seen\nfrom another perspective where the discriminator tries to tell apart real examples x \u223c px from fake\nexamples xg \u223c pg by minimizing an appropriate loss function[10, Ch. 14.2.4] [21], and the generator\ntries to generate samples that maximize that loss [39, 11].\nOne of the primary motivations for studying deep generative models is for semi-supervised learning.\nIndeed, several recent works have shown promising empirical results on semi-supervised learning\nwith both implicit as well as prescribed generative models [17, 32, 34, 9, 20, 29, 35]. Most state-of-\nthe-art semi-supervised learning methods using GANs [34, 9, 29] use the discriminator of the GAN\nas the classi\ufb01er which now outputs k + 1 probabilities (k probabilities for the k real classes and one\nprobability for the fake class).\nWhen the generator of a trained GAN produces very realistic images, it can be argued to capture\nthe data manifold well whose properties can be used for semi-supervised learning. In particular, the\n\n\u2217Contributed equally.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ftangent spaces of the manifold can inform us about the desirable invariances one may wish to inject\nin a classi\ufb01er [36, 33]. In this work we make following contributions:\n\u2022 We propose to use the tangents from the generator\u2019s mapping to automatically infer the desired\ninvariances and further improve on semi-supervised learning. This can be contrasted with methods\nthat assume the knowledge of these invariances (e.g., rotation, translation, horizontal \ufb02ipping, etc.)\n[36, 18, 25, 31].\n\u2022 Estimating tangents for a real sample x requires us to learn an encoder h that maps from data to\nlatent space (inference), i.e., h : X \u2192 Z. We propose enhancements over existing methods for\nlearning the encoder [8, 9] which improve the semantic match between x and g(h(x)) and counter\nthe problem of class-switching.\n\u2022 Further, we provide insights into the workings of GAN based semi-supervised learning methods\n\n[34] on how fake examples affect the learning.\n\n2 Semi-supervised learning using GANs\n\nMost of the existing methods for semi-supervised learning using GANs modify the regular GAN\ndiscriminator to have k outputs corresponding to k real classes [38], and in some cases a (k + 1)\u2019th\noutput that corresponds to fake samples from the generator [34, 29, 9]. The generator is mainly\nused as a source of additional data (fake samples) which the discriminator tries to classify under the\n(k + 1)th label. We propose to use the generator to obtain the tangents to the image manifold and use\nthese to inject invariances into the classi\ufb01er [36].\n\n2.1 Estimating the tangent space of data manifold\n\nEarlier work has used contractive autoencoders (CAE) to estimate the local tangent space at each\npoint [33]. CAEs optimize the regular autoencoder loss (reconstruction error) augmented with an\nadditional (cid:96)2-norm penalty on the Jacobian of the encoder mapping. Rifai et al. [33] intuitively reason\nthat the encoder of the CAE trained in this fashion is sensitive only to the tangent directions and use\nthe dominant singular vectors of the Jacobian of the encoder as the tangents. This, however, involves\nextra computational overhead of doing an SVD for every training sample which we will avoid in\nour GAN based approach. GANs have also been established to generate better quality samples than\nprescribed models (e.g., reconstruction loss based approaches) like VAEs [16] and hence can be\nargued to learn a more accurate parameterization of the image manifold.\nThe trained generator of the GAN serves as a parametric mapping from a low dimensional space Z to\na manifold M embedded in the higher dimensional space X, g : Z \u2192 X, where Z is an open subset\nin Rd and X is an open subset in RD under the standard topologies on Rd and RD, respectively\n(d (cid:28) D). This map is not surjective and the range of g is restricted to M.2 We assume g is a smooth,\ninjective mapping, so that M is an embedded manifold. The Jacobian of a function f : Rd \u2192 RD\nat z \u2208 Rd, Jzf, is the matrix of partial derivatives (of shape D \u00d7 d). The Jacobian of g at z \u2208 Z,\nJzg, provides a mapping from the tangent space at z \u2208 Z into the tangent space at x = g(z) \u2208 X,\ni.e., Jzg : TzZ \u2192 TxX. It should be noted that TzZ is isomorphic to Rd and TxX is isomorphic to\nRD. However, this mapping is not surjective and the range of Jzg is restricted to the tangent space of\nthe manifold M at x = g(z), denoted as TxM (for all z \u2208 Z). As GANs are capable of generating\nrealistic samples (particularly for natural images), one can argue that M approximates the true data\nmanifold well and hence the tangents to M obtained using Jzg are close to the tangents to the true\ndata manifold. The problem of learning a smooth manifold from \ufb01nite samples has been studied in\nthe literature[5, 2, 27, 6, 40, 19, 14, 3], and it is an interesting problem in its own right to study the\nmanifold approximation error of GANs, which minimize a chosen divergence measure between the\ndata distribution and the fake distribution [28, 23] using \ufb01nite samples, however this is outside the\nscope of the current work.\nFor a given data sample x \u2208 X, we need to \ufb01nd its corresponding latent representation z before we\ncan use Jzg to get the tangents to the manifold M at x. For our current discussion we assume the\navailability of a so-called encoder h : X \u2192 Z, such that h(g(z)) = z \u2200 z \u2208 Z. By de\ufb01nition, the\n\n2We write g as a map from Z to X to avoid the unnecessary (in our context) burden of manifold terminologies\nand still being technically correct. This also enables us to get the Jacobian of g as a regular matrix in RD\u00d7d,\ninstead of working with the differential if g was taken as a map from Z to M.\n\n2\n\n\fJacobian of the generator at z, Jzg, can be used to get the tangent directions to the manifold at a point\nx = g(z) \u2208 M. The following lemma speci\ufb01es the conditions for existence of the encoder h and\nshows that such an encoder can also be used to get tangent directions. Later we will come back to the\nissues involved in training such an encoder.\nLemma 2.1. If the Jacobian of g at z \u2208 Z, Jzg, is full rank then g is locally invertible in the open\nneighborhood g(S) (S being an open neighborhood of z), and there exists a smooth h : g(S) \u2192 S\nsuch that h(g(y)) = y,\u2200 y \u2208 S. In this case, the Jacobian of h at x = g(z), Jxh, spans the tangent\nspace of M at x.\n\nProof. We refer the reader to standard textbooks on multivariate calculus and differentiable manifolds\nfor the \ufb01rst statement of the lemma (e.g., [37]).\nThe second statement can be easily deduced by looking at the Jacobian of the composition of functions\nh \u25e6 g. We have Jz(h \u25e6 g) = Jg(z)h Jzg = Jxh Jzg = Id\u00d7d, since h(g(z)) = z. This implies that the\nrow span of Jxh coincides with the column span of Jzg. As the columns of Jzg span the tangent\nspace Tg(z)M, so do the the rows of Jxh.\n2.1.1 Training the inverse mapping (the encoder)\nTo estimate the tangents for a given real data point x \u2208 X, we need its corresponding latent\nrepresentation z = h(x) \u2208 Z, such that g(h(x)) = x in an ideal scenario. However, in practice\ng will only learn an approximation to the true data manifold, and the mapping g \u25e6 h will act like\na projection of x (which will almost always be off the manifold M) to the manifold M, yielding\nsome approximation error. This projection may not be orthogonal, i.e., to the nearest point on M.\nNevertheless, it is desirable that x and g(h(x)) are semantically close, and at the very least, the class\nlabel is preserved by the mapping g \u25e6 h. We studied the following three approaches for training the\ninverse map h, with regard to this desideratum :\n\u2022 Decoupled training. This is similar to an approach outlined by Donahue et al. [8] where the\ngenerator is trained \ufb01rst and \ufb01xed thereafter, and the encoder is trained by optimizing a suitable\nreconstruction loss in the Z space, L(z, h(g(z))) (e.g., cross entropy, (cid:96)2). This approach does not\nyield good results and we observe that most of the time g(h(x)) is not semantically similar to\nthe given real sample x with change in the class label. One of the reasons as noted by Donahue\net al. [8] is that the encoder never sees real samples during training. To address this, we also\nexperimented with the combined objective minh Lz(z, h(g(z))) + Lh(x, g(h(x))), however this\ntoo did not yield any signi\ufb01cant improvements in our early explorations.\n\u2022 BiGAN. Donahue et al. [8] propose to jointly train the encoder and generator using adversarial\ntraining, where the pair (z, g(z)) is considered a fake example (z \u223c pz) and the pair (h(x), x) is\nconsidered a real example by the discriminator. A similar approach is proposed by Dumoulin\net al. [9], where h(x) gives the parameters of the posterior p(z|x) and a stochastic sample from\nthe posterior paired with x is taken as a real example. We use BiGAN [8] in this work, with one\nmodi\ufb01cation: we use feature matching loss [34] (computed using features from an intermediate\nlayer (cid:96) of the discriminator f), i.e., (cid:107)Exf(cid:96)(h(x), x) \u2212 Ezf(cid:96)(z, g(z))(cid:107)2\n2, to optimize the generator\nand encoder, which we found to greatly help with the convergence 3. We observe better results in\nterms of semantic match between x and g(h(x)) than in the decoupled training approach, however,\nwe still observe a considerable fraction of instances where the class of g(h(x)) is changed (let us\nrefer to this as class-switching).\n\u2022 Augmented-BiGAN. To address the still-persistent problem of class-switching of the recon-\nstructed samples g(h(x)), we propose to construct a third pair (h(x), g(h(x)) which is also con-\nsidered by the discriminator as a fake example in addition to (z, g(z)). Our Augmented-BiGAN\nobjective is given as\nEx\u223cpx log f (h(x), x) +\n(1)\nwhere f (\u00b7,\u00b7) is the probability of the pair being a real example, as assigned by the discriminator\nf. We optimize the discriminator using the above objective (1). The generator and encoder are\nagain optimized using feature matching [34] loss on an intermediate layer (cid:96) of the discriminator,\ni.e., Lgh = (cid:107)Exf(cid:96)(h(x), x) \u2212 Ezf(cid:96)(z, g(z))(cid:107)2\n2, to help with the convergence. Minimizing Lgh\n3Note that other recently proposed methods for training GANs based on Integral Probability Metrics\n\nEx\u223cpx log(1 \u2212 f (h(x), g(h(x))),\n\n1\n2\n\nEz\u223cpz log(1 \u2212 f (z, g(z))) +\n\n1\n2\n\n[1, 13, 26, 24] could also improve the convergence and stability during training.\n\n3\n\n\fwill make x and g(h(x)) similar (through the lens of f(cid:96)) as in the case of BiGAN, however\nthe discriminator tries to make the features at layer f(cid:96) more dif\ufb01cult to achieve this by directly\noptimizing the third term in the objective (1). This results in improved semantic similarity between\nx and g(h(x)).\n\nWe empirically evaluate these approaches with regard to similarity between x and g(h(x)) both\nquantitatively and qualitatively, observing that Augmented-BiGAN works signi\ufb01cantly better than\nBiGAN. We note that ALI [9] also has the problems of semantic mismatch and class switching for\nreconstructed samples as reported by the authors, and a stochastic version of the proposed third term\nin the objective (1) can potentially help there as well, investigation of which is left for future work.\n\n2.1.2 Estimating the dominant tangent space\n\nOnce we have a trained encoder h such that g(h(x)) is a good approximation to x and h(g(z)) is a\ngood approximation to z, we can use either Jh(x)g or Jxh to get an estimate of the tangent space.\nSpeci\ufb01cally, the columns of Jh(x)g and the rows of Jxh are the directions that approximately span\nthe tangent space to the data manifold at x. Almost all deep learning packages implement reverse\nmode differentiation (to do backpropagation) which is computationally cheaper than forward mode\ndifferentiation for computing the Jacobian when the output dimension of the function is low (and\nvice versa when the output dimension is high). Hence we use Jxh in all our experiments to get the\ntangents.\nAs there are approximation errors at several places (M \u223c data-manifold, g(h(x)) \u223c x, h(g(z)) \u223c z),\nit is preferable to only consider dominant tangent directions in the row span of Jxh. These can be\nobtained using the SVD on the matrix Jxh and taking the right singular vectors corresponding to\ntop singular values, as done in [33] where h is trained using a contractive auto-encoder. However,\nthis process is expensive as the SVD needs to be done independently for every data sample. We\nadopt an alternative approach to get dominant tangent direction: we take the pre-trained model with\nencoder-generator-discriminator (h-g-f) triple and insert two extra functions p : Rd \u2192 Rdp and \u00afp :\nRdp \u2192 Rd (with dp < d) which are learned by optimizing minp, \u00afp Ex[(cid:107)g(h(x))\u2212 g(\u00afp(p(h(x))))(cid:107)1 +\n(cid:107)f X\u22121(g(h(x))) \u2212 f X\u22121(g(\u00afp(p(h(x)))))(cid:107)] while g, h and f are kept \ufb01xed from the pre-trained model.\nNote that our discriminator f has two pipelines f Z and f X for the latent z \u2208 Z and the data x \u2208 X,\nrespectively, which share parameters in the last few layers (following [8]), and we use the last layer of\nf X in this loss. This enables us to learn a nonlinear (low-dimensional) approximation in the Z space\nsuch that g(\u00afp(p(h(x)))) is close to g(h(x)). We use the Jacobian of p \u25e6 h, Jx p \u25e6 h, as an estimate of\nthe dp dominant tangent directions (dp = 10 in all our experiments)4.\n\n2.2\n\nInjecting invariances into the classi\ufb01er using tangents\n\n(cid:80)\n\nlabeled examples, it uses a regularizer of the form(cid:80)n\n\nWe use the tangent propagation approach (TangentProp) [36] to make the classi\ufb01er invariant to the\nestimated tangent directions from the previous section. Apart form the regular classi\ufb01cation loss on\n2, where Jxic \u2208 Rk\u00d7D\nis the Jacobian of the classi\ufb01er function c at x = xi (with the number of classes k). and Tx is the\nset of tangent directions we want the classi\ufb01er to be invariant to. This term penalizes the linearized\nvariations of the classi\ufb01er output along the tangent directions. Simard et al. [36] get the tangent\ndirections using slight rotations and translations of the images, whereas we use the GAN to estimate\nthe tangents to the data manifold.\nWe can go one step further and make the classi\ufb01er invariant to small perturbations in all directions\nemanating from a point x. This leads to the regularizer\n\n(cid:107)(Jxic) v(cid:107)2\n\nv\u2208Txi\n\ni=1\n\nj \u2264 k(cid:88)\n\ni=1\n\nk(cid:88)\n\ni=1\n\n(cid:107)(Jxc) v(cid:107)j\n\nsup\n\nv:(cid:107)v(cid:107)p\u2264\u0001\n\n|(Jxc)i: v|j = \u0001j\n\nsup\n\nv:(cid:107)v(cid:107)p\u2264\u0001\n\n(cid:107)(Jxc)i:(cid:107)j\nq,\n\n(2)\n\nwhere (cid:107)\u00b7(cid:107)q is the dual norm of (cid:107)\u00b7(cid:107)p (i.e., 1\nj denotes jth power of (cid:96)j-norm. This\nreduces to squared Frobenius norm of the Jacobian matrix Jxc for p = j = 2. The penalty in\n4Training the GAN with z \u2208 Z \u2282 Rdp results in a bad approximation of the data manifold. Hence we \ufb01rst\nlearn the GAN with Z \u2282 Rd and then approximate the smooth manifold M parameterized by the generator\nusing p and \u00afp to get the dominant dp tangent directions to M.\n\nq = 1), and (cid:107)\u00b7(cid:107)j\n\np + 1\n\n4\n\n\fEq. (2) is closely related to the recent work on virtual adversarial training (VAT) [22] which uses a\nregularizer (ref. Eq (1), (2) in [22])\n\nKL[c(x)||c(x + v)],\n\nsup\n\nv:(cid:107)v(cid:107)2\u2264\u0001\n\n(3)\n\nwhere c(x) are the classi\ufb01er outputs (class probabilities). VAT[22] approximately estimates v\u2217 that\nyields the sup using the gradient of KL[c(x)||c(x + v)], calling (x + v\u2217) as virtual adversarial\nexample (due to its resemblance to adversarial training [12]), and uses KL[c(x)||c(x + v\u2217)] as the\nregularizer in the classi\ufb01er objective. If we replace KL-divergence in Eq. 3 with total-variation\ndistance and optimize its \ufb01rst-order approximation, it becomes equivalent to the regularizer in Eq. (2)\nfor j = 1 and p = 2.\nIn practice, it is computationally expensive to optimize these Jacobian based regularizers. Hence in all\nour experiments we use stochastic \ufb01nite difference approximation for all Jacobian based regularizers.\nFor TangentProp, we use (cid:107)c(xi + v) \u2212 c(xi)(cid:107)2\n2 with v randomly sampled (i.i.d.) from the set of\ntangents Txi every time example xi is visited by the SGD. For Jacobian-norm regularizer of Eq. (2),\nwe use (cid:107)c(x + \u03b4)\u2212 c(x)(cid:107)2\n2 with \u03b4 \u223c N (0, \u03c32I) (i.i.d) every time an example x is visited by the SGD,\nwhich approximates an upper bound on Eq. (2) in expectation (up to scaling) for j = 2 and p = 2.\n\n2.3 GAN discriminator as the classi\ufb01er for semi-supervised learning: effect of fake examples\n\nRecent works have used GANs for semi-supervised learning where the discriminator also serves as a\nclassi\ufb01er [34, 9, 29]. For a semi-supervised learning problem with k classes, the discriminator has\nk + 1 outputs with the (k + 1)\u2019th output corresponding to the fake examples originating from the\ngenerator of the GAN. The loss for the discriminator f is given as [34]\n\nLf = Lf\nand Lf\n\nsup + Lf\nunsup = \u2212Ex\u223cpg(x) log(pf (y = k + 1|x)) \u2212 Ex\u223cpd(x) log(1 \u2212 pf (y = k + 1|x))).\n\nsup = \u2212E(x,y)\u223cpd(x,y) log pf (y|x, y \u2264 k)\n\nunsup, where Lf\n\n(4)\n\nThe term pf (y = k + 1|x) is the probability of x being a fake example and (1 \u2212 pf (y = k + 1|x))\nis the probability of x being a real example (as assigned by the model). The loss component\nLf\nunsup is same as the regular GAN discriminator loss with the only modi\ufb01cation that probabilities\nfor real vs. fake are compiled from (k + 1) outputs. Salimans et al. [34] proposed training the\ngenerator using feature matching where the generator minimizes the mean discrepancy between the\nfeatures for real and fake examples obtained from an intermediate layer (cid:96) of the discriminator f,\ni.e., Lg = (cid:107)Exf(cid:96)(x) \u2212 Ezf(cid:96)(g(z))(cid:107)2\n2. Using feature matching loss for the generator was empirically\nshown to result in much better accuracy for semi-supervised learning compared to other training\nmethods including minibatch discrimination and regular GAN generator loss [34].\nHere we attempt to develop an intuitive understanding of how fake examples in\ufb02uence the learning\nof the classi\ufb01er and why feature matching loss may work much better for semi-supervised learning\ncompared to regular GAN. We will use the term classi\ufb01er and discriminator interchangeably based\non the context however they are really the same network as mentioned earlier. Following [34] we\nassume the (k + 1)\u2019th logit is \ufb01xed to 0 as subtracting a term v(x) from all logits does not change the\nsoftmax probabilities. Rewriting the unlabeled loss of Eq. (4) in terms of logits li(x), i = 1, 2, . . . , k,\nwe have\n\n(cid:32)\n\nk(cid:88)\n\n(cid:33)\n\n(cid:34)\n\nk(cid:88)\n\nunsup = Exg\u223cpg log\nLf\n\n1 +\n\neli(xg)\n\n\u2212 Ex\u223cpd\n\nlog\n\neli(x) \u2212 log\n\n1 +\n\neli(x)\n\n(5)\n\n(cid:32)\n\nk(cid:88)\n\n(cid:33)(cid:35)\n\ni=1\n\ni=1\n\ni=1\n\nTaking the derivative w.r.t. discriminator\u2019s parameters \u03b8 followed by some basic algebra, we get\n\u2207\u03b8Lf\n\nunsup =\n\nk(cid:88)\nk(cid:88)\n\ni=1\n\ni=1\n\n(cid:124)\n\nE\n\nxg\u223cpg\n\npf (y = i|xg)\u2207li(xg) \u2212 E\nx\u223cpd\n\n= E\nxg\u223cpg\n\npf (y = i|xg)\n\n\u2207li(xg) \u2212 E\nx\u223cpd\n\n(cid:123)(cid:122)\n\nai(xg)\n\n(cid:125)\n\n(cid:34) k(cid:88)\npf (y = i|x, y \u2264 k)\u2207li(x) \u2212 k(cid:88)\nk(cid:88)\n(cid:124)\n\npf (y = i|x, y \u2264 k)pf (y = k + 1|x)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\ni=1\n\ni=1\n\ni=1\n\nbi(x)\n\n(cid:35)\n\npf (y = i|x)\u2207li(x)\n\n\u2207li(x)\n\n(6)\n\n5\n\n\fMinimizing Lf\nunsup will move the parameters \u03b8 so as to decrease li(xg) and increase li(x) (i =\n1, . . . , k). The rate of increase in li(x) is also modulated by pf (y = k + 1|x). This results in warping\nof the functions li(x) around each real example x with more warping around examples about which\nthe current model f is more con\ufb01dent that they belong to class i: li(\u00b7) becomes locally concave\naround those real examples x if xg are loosely scattered around x. Let us consider the following three\ncases:\nWeak fake examples. When the fake examples coming from the generator are very weak (i.e.,\nvery easy for the current discriminator to distinguish from real examples), we will have pf (y =\nk + 1|xg) \u2248 1, pf (y = i|xg) \u2248 0 for 1 \u2264 i \u2264 k and pf (y = k + 1|x) \u2248 0. Hence there is no\ngradient \ufb02ow from Eq. (6), rendering unlabeled data almost useless for semi-supervised learning.\nStrong fake examples. When the fake examples are very strong (i.e., dif\ufb01cult for the current\ndiscriminator to distinguish from real ones), we have pf (k + 1|xg) \u2248 0.5 + \u00011, pf (y = imax|xg) \u2248\n0.5\u2212 \u00012 for some imax \u2208 {1, . . . , k} and pf (y = k + 1|x) \u2248 0.5\u2212 \u00013 (with \u00012 > \u00011 \u2265 0 and \u00013 \u2265 0).\nNote that bi(x) in this case would be smaller than ai(x) since it is a product of two probabilities. If\ntwo examples x and xg are close to each other with imax = arg maxi li(x) = arg maxi li(xg) (e.g.,\nx is a cat image and xg is a highly realistic generated image of a cat), the optimization will push\nlimax(x) up by some amount and will pull limax(xg) down by a larger amount. We further want to\nconsider two cases here: (i) Classi\ufb01er with enough capacity: If the classi\ufb01er has enough capacity,\nthis will make the curvature of limax(\u00b7) around x really high (with limax(\u00b7) locally concave around x)\nsince x and xg are very close. This results in over-\ufb01tting around the unlabeled examples and for a test\nexample xt closer to xg (which is quite likely to happen since xg itself was very realistic sample), the\nmodel will more likely misclassify xt. (ii) Controlled-capacity classi\ufb01er: Suppose the capacity of\nthe classi\ufb01er is controlled with adequate regularization. In that case the curvature of the function\nlimax(\u00b7) around x cannot increase beyond a point. However, this results in limax(x) being pulled down\nby the optimization process since ai(xg) > bi(x). This is more pronounced for examples x on which\nthe classi\ufb01er is not so con\ufb01dent (i.e., pf (y = imax|x, y \u2264 k) is low, although still assigning highest\nprobability to class imax) since the gap between ai(xg) and bi(x) becomes higher. For these examples,\nthe entropy of the distribution {p(y = i|x, y \u2264 k)}k\ni=1 may actually increase as the training proceeds\nwhich can hurt the test performance.\nModerate fake examples. When the fake examples from the generator are neither too weak nor too\nstrong for the current discriminator (i.e., xg is a somewhat distorted version of x), the unsupervised\ngradient will push limax(x) up while pulling limax(xg) down, giving rise to a moderate curvature of\nli(\u00b7) around real examples x since xg and x are suf\ufb01ciently far apart (consider multiple distorted cat\nimages scattered around a real cat image at moderate distances). This results in a smooth decision\nfunction around real unlabeled examples. Again, the curvatures of li(\u00b7) around x for classes i which\nthe current classi\ufb01er does not trust for the example x are not affected much. Further, pf (y = k + 1|x)\nwill be less than the case when fake examples are very strong. Similarly pf (y = imax|xg) (where\nimax = arg max1\u2264i\u2264k li(xg)) will be less than the case of strong fake examples. Hence the norm\nof the gradient in Eq. (6) is lower and the contribution of unlabeled data in the overall gradient of\nLf (Eq. (4) is lower than the case of strong fake examples. This intuitively seems bene\ufb01cial as the\nclassi\ufb01er gets ample opportunity to learn on supervised loss and get con\ufb01dent on the right class for\nunlabeled examples, and then boost this con\ufb01dence slowly using the gradient of Eq. (6) as the training\nproceeds.\nWe experimented with regular GAN loss (i.e., Lg = Ex\u223cpg log(pf (y = k + 1|x))), and feature\nmatching loss for the generator [34], plotting several of the quantities of interest discussed above\nfor MNIST (with 100 labeled examples) and SVHN (with 1000 labeled examples) datasets in Fig.1.\nGenerator trained with feature matching loss corresponds to the case of moderate fake examples\ndiscussed above (as it generates blurry and distorted samples as mentioned in [34]). Generator\ntrained with regular GAN loss corresponds to the case of strong fake examples discussed above.\nWe plot Exg aimax(xg) for imax = arg max1\u2264i\u2264k li(xg) and Exg [ 1\n1\u2264i(cid:54)=imax\u2264k ai(xg)] separately\nk\u22121\nto look into the behavior of imax logit. Similarly we plot Exbt(x) separately where t is the true\nlabel for unlabeled example x (we assume knowledge of the true label only for plotting these\nquantities and not while training the semi-supervised GAN). Other quantities in the plots are self-\nexplanatory. As expected, the unlabeled loss Lf\nunsup for regular GAN becomes quite high early on\nimplying that fake examples are strong. The gap between aimax(xg) and bt(x) is also higher for\nregular GAN pointing towards the case of strong fake examples with controlled-capacity classi\ufb01er as\ndiscussed above. Indeed, we see that the average of the entropies for the distributions pf (y|x) (i.e.,\n\n(cid:80)\n\n6\n\n\fFigure 1: Plots of Entropy, Lf\nGAN generator loss and feature-matching GAN generator loss.\n\nunsup (Eq. (4)), ai(xg), bi(x) and other probabilities (Eq. (6)) for regular\n\nExH(pf (y|x, y \u2264 k))) is much lower for feature-matching GAN compared to regular GAN (seven\ntimes lower for SVHN, ten times lower for MNIST). Test errors for MNIST for regular GAN and\nFM-GAN were 2.49% (500 epochs) and 0.86% (300 epochs), respectively. Test errors for SVHN\nwere 13.36% (regular-GAN at 738 epochs) and 5.89% (FM-GAN at 883 epochs), respectively5.\nIt should also be emphasized that the semi-supervised learning heavily depends on the generator\ndynamically adapting fake examples to the current discriminator \u2013 we observed that freezing the\ntraining of the generator at any point results in the discriminator being able to classify them easily\n(i.e., pf (y = k + 1|xg) \u2248 1) thus stopping the contribution of unlabeled examples in the learning.\nOur \ufb01nal loss for semi-supervised learning. We use feature matching GAN with semi-supervised\nloss of Eq. (4) as our classi\ufb01er objective and incorporate invariances from Sec. 2.2 in it. Our \ufb01nal\nobjective for the GAN discriminator is\n\n(cid:88)\n\nv\u2208Tx\n\nLf = Lf\n\nsup + Lf\n\nunsup + \u03bb1Ex\u223cpd(x)\n\n(cid:107)(Jxf ) v(cid:107)2\n\n2 + \u03bb2Ex\u223cpd(x)(cid:107)Jxf(cid:107)2\nF .\n\n(7)\n\nThe third term in the objective makes the classi\ufb01er decision function change slowly along tangent\ndirections around a real example x. As mentioned in Sec. 2.2 we use stochastic \ufb01nite difference\napproximation for both Jacobian terms due to computational reasons.\n\n3 Experiments\nImplementation Details. The architecture of the endoder, generator and discriminator closely\nfollow the network structures in ALI [9]. We remove the stochastic layer from the ALI encoder (i.e.,\nh(x) is deterministic). For estimating the dominant tangents, we employ fully connected two-layer\nnetwork with tanh non-linearly in the hidden layer to represent p \u25e6 \u00afp. The output of p is taken from\nthe hidden layer. Batch normalization was replaced by weight normalization in all the modules to\nmake the output h(x) (similarly g(z)) dependent only on the given input x (similarly z) and not on\nthe whole minibatch. This is necessary to make the Jacobians Jxh and Jzg independent of other\nexamples in the minibatch. We replaced all ReLU nonlinearities in the encoder and the generator\nwith the Exponential Linear Units (ELU) [7] to ensure smoothness of the functions g and h. We\nfollow [34] completely for optimization (using ADAM optimizer [15] with the same learning rates as\nin [34]). Generators (and encoders, if applicable) in all the models are trained using feature matching\nloss.\n\n5We also experimented with minibatch-discrimination (MD) GAN[34] but the minibatch features are not\nsuited for classi\ufb01cation as the prediction for an example x is adversely affected by features of all other examples\n(note that this is different from batch-normalization). Indeed we notice that the training error for MD-GAN is\n10x that of regular GAN and FM-GAN. MD-GAN gave similar test error as regular-GAN.\n\n7\n\n\fFigure 2: Comparing BiGAN with Augmented BiGAN based on the classi\ufb01cation error on the\nreconstructed test images. Left column: CIFAR10, Right column: SVHN. In the images, the top row\ncorresponds to the original images followed by BiGAN reconstructions in the middle row and the\nAugmented BiGAN reconstructions in the bottom row. More images can be found in the appendix.\n\nFigure 3: Visualizing tangents. Top: CIFAR10, Bottom: SVHN. Odd rows: Tangents using our\nmethod for estimating the dominant tangent space. Even rows: Tangents using SVD on Jh(x)g and\nJxh. First column: Original image. Second column: Reconstructed image using g \u25e6 h. Third column:\nReconstructed image using g \u25e6 \u00afp \u25e6 p \u25e6 h. Columns 4-13: Tangents using encoder. Columns 14-23:\nTangents using generator.\n\nSemantic Similarity. The image samples x and their reconstructions g(h(x)) for BiGAN and\nAugemented-BiGAN can be seen in Fig. 2. To quantitatively measure the semantic similarity of the\nreconstructions to the original images, we learn a supervised classi\ufb01er using the full training set and\nobtain the classi\ufb01cation accuracy on the reconstructions of the test images. The architectures of the\nclassi\ufb01er for CIFAR10 and SVHN are similar to their corresponding GAN discriminator architectures\nwe have. The lower error rates with our Augmented-BiGAN suggest that it leads to reconstructions\nwith reduced class-switching.\nTangent approximations. Tangents for CIFAR10 and SVHN are shown in Fig. 3. We show visual\ncomparison of tangents from Jx(p \u25e6 h), from Jp(h(x))g \u25e6 \u00afp, and from Jxh and Jh(x)g followed\nby the SVD to get the dominant tangents. It can be seen that the proposed method for getting\ndominant tangent directions gives similar tangents as SVD. The tangents from the generator (columns\n14-23) look different (more colorful) from the tangents from the encoder (columns 4-13) though they\ndo trace the boundaries of the objects in the image (just like the tangents from the encoder). We\nalso empirically quantify our method for dominant tangent subspace estimation against the SVD\nestimation by computing the geodesic distances and principal angles between these two estimations.\nThese results are shown in Table 2.\nSemi-supervised learning results. Table 1 shows the results for SVHN and CIFAR10 with various\nnumber of labeled examples. For all experiments with the tangent regularizer for both CIFAR10\nand SVHN, we use 10 tangents. The hyperparameters \u03bb1 and \u03bb2 in Eq. (7) are set to 1. We obtain\nsigni\ufb01cant improvements over baselines, particularly for SVHN and more so for the case of 500\n\n8\n\n\f\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\nSVHN\n\nNl = 1000\n36.02 \u00b1 0.10\n\n23.56\n24.63\n\n\u2013\n\n16.61 \u00b1 0.24\n7.41 \u00b1 0.65\n8.11 \u00b1 1.3\n4.42 \u00b1 0.16\n4.74 \u00b1 1.2\n5.26 \u00b1 1.1\n4.39 \u00b1 1.2\n\n\u2013\n\u2013\n\u2013\n\u2013\n\n20.40\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n\u2013\n\nModel\n\nNl = 500\n\nCIFAR-10\n\nNl = 1000\n\nNl = 4000\n\nVAE (M1+M2) [17]\n\nSWWAE with dropout [41]\n\nVAT [22]\n\nSkip DGM [20]\n\nLadder network [32]\n\nALI [9]\n\nFM-GAN [34]\n\nTemporal ensembling [18]\n\nFM-GAN + Jacob.-reg (Eq. (2))\n\n17.99 \u00b1 1.62\n18.63 \u00b1 2.32\n12.16 \u00b1 0.24\n16.84 \u00b1 1.5\n16.96 \u00b1 1.4\n16.20 \u00b1 1.6\nTable 1: Test error with semi-supervised learning on SVHN and CIFAR-10 (Nl is the number of\nlabeled examples). All results for the proposed methods (last 3 rows) are obtained with training the\nmodel for 600 epochs for SVHN and 900 epochs for CIFAR10, and are averaged over 5 runs.\n\n19.98 \u00b1 0.89\n21.83 \u00b1 2.01\n20.87 \u00b1 1.7\n20.23 \u00b1 1.3\n19.52 \u00b1 1.5\n\n18.44 \u00b1 4.8\n5.12 \u00b1 0.13\n10.28 \u00b1 1.8\n5.88 \u00b1 1.5\n4.87 \u00b1 1.6\n\nFM-GAN + Jacob.-reg + Tangents\n\nFM-GAN + Tangents\n\nd(S1,S2)\n\nRand-Rand\n\nSVD-Approx. (CIFAR)\nSVD-Approx. (SVHN)\n\n4.5\n2.6\n2.3\n\n\u03b81\n14\n2\n1\n\n\u03b82\n83\n15\n7\n\n\u03b83\n85\n21\n12\n\n\u03b84\n86\n26\n16\n\n\u03b85\n87\n34\n22\n\n\u03b86\n87\n40\n30\n\n\u03b87\n88\n50\n41\n\n\u03b88\n88\n61\n51\n\n\u03b89\n88\n73\n67\n\n\u03b810\n89\n85\n82\n\nTable 2: Dominant tangent subspace approximation quality: Columns show the geodesic distance\nand 10 principal angles between the two subspaces. Top row shows results for two randomly sampled\n10-dimensional subspaces in 3072-dimensional space, middle and bottom rows show results for\ndominant subspace obtained using SVD of Jxh and dominant subspace obtained using our method,\nfor CIFAR-10 and SVHN, respectively. All numbers are averages 10 randomly sampled test examples.\n\nlabeled examples. We do not get as good results on CIFAR10 which may be due to the fact that\nour encoder for CIFAR10 is still not able to approximate the inverse of the generator well (which\nis evident from the sub-optimal reconstructions we get for CIFAR10) and hence the tangents we\nget are not good enough. We think that obtaining better estimates of tangents for CIFAR10 has the\npotential for further improving the results. ALI [9] accuracy for CIFAR (Nl = 1000) is also close to\nour results however ALI results were obtained by running the optimization for 6475 epochs with a\nslower learning rate as mentioned in [9]. Temporal ensembling [18] using explicit data augmentation\nassuming knowledge of the class-preserving transformations on the input, while our method estimates\nthese transformations from the data manifold in the form of tangent vectors. It outperforms our\nmethod by a signi\ufb01cant margin on CIFAR-10 which could be due the fact that it uses horizontal\n\ufb02ipping based augmentation for CIFAR-10 which cannot be learned through the tangents as it is a\nnon-smooth transformation. The use of temporal ensembling in conjunction with our method has the\npotential of further improving the semi-supervised learning results.\n4 Discussion\nOur empirical results show that using the tangents of the data manifold (as estimated by the generator\nof the GAN) to inject invariances in the classi\ufb01er improves the performance on semi-supevised\nlearning tasks. In particular we observe impressive accuracy gains on SVHN (more so for the\ncase of 500 labeled examples) for which the tangents obtained are good quality. We also observe\nimprovements on CIFAR10 though not as impressive as SVHN. We think that improving on the\nquality of tangents for CIFAR10 has the potential for further improving the results there, which is a\ndirection for future explorations. We also shed light on the effect of fake examples in the common\nframework used for semi-supervised learning with GANs where the discriminator predicts real\nclass labels along with the fake label. Explicitly controlling the dif\ufb01culty level of fake examples\n(i.e., pf (y = k + 1|xg) and hence indirectly pf (y = k + 1|x) in Eq. (6)) to do more effective\nsemi-supervised learning is another direction for future work. One possible way to do this is to have a\ndistortion model for the real examples (i.e., replace the generator with a distorter that takes as input\nthe real examples) whose strength is controlled for more effective semi-supervised learning.\n\n9\n\n\fReferences\n[1] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[2] Alexander V Bernstein and Alexander P Kuleshov. Tangent bundle manifold learning via\n\ngrassmann&stiefel eigenmaps. arXiv preprint arXiv:1212.6031, 2012.\n\n[3] AV Bernstein and AP Kuleshov. Data-based manifold reconstruction via tangent bundle\nmanifold learning. In ICML-2014, Topological Methods for Machine Learning Workshop,\nBeijing, volume 25, pages 1\u20136, 2014.\n\n[4] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative\n\nadversarial networks. arXiv preprint arXiv:1703.10717, 2017.\n\n[5] Guillermo Canas, Tomaso Poggio, and Lorenzo Rosasco. Learning manifolds with k-means\nand k-\ufb02ats. In Advances in Neural Information Processing Systems, pages 2465\u20132473, 2012.\n\n[6] Guangliang Chen, Anna V Little, Mauro Maggioni, and Lorenzo Rosasco. Some recent advances\nin multiscale geometric analysis of point clouds. In Wavelets and Multiscale Analysis, pages\n199\u2013225. Springer, 2011.\n\n[7] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network\n\nlearning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\n[8] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. arXiv\n\npreprint arXiv:1605.09782, 2016.\n\n[9] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier\narXiv preprint\n\nMastropietro, and Aaron Courville. Adversarially learned inference.\narXiv:1606.00704, 2016.\n\n[10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,\n\nvolume 1. Springer series in statistics Springer, Berlin, 2001.\n\n[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[12] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv preprint arXiv:1412.6572, 2014.\n\n[13] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville.\n\nImproved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.\n\n[14] Kui Jia, Lin Sun, Shenghua Gao, Zhan Song, and Bertram E Shi. Laplacian auto-encoders: An\n\nexplicit learning of nonlinear data manifold. Neurocomputing, 160:250\u2013260, 2015.\n\n[15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[16] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[17] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\nsupervised learning with deep generative models. In Advances in Neural Information Processing\nSystems, pages 3581\u20133589, 2014.\n\n[18] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint\n\narXiv:1610.02242, 2016.\n\n[19] G Lerman and T Zhang. Probabilistic recovery of multiple subspaces in point clouds by\n\ngeometric (cid:96)p minimization. Preprint, 2010.\n\n[20] Lars Maal\u00f8e, Casper Kaae S\u00f8nderby, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Auxiliary deep\n\ngenerative models. arXiv preprint arXiv:1602.05473, 2016.\n\n10\n\n\f[21] Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability\n\nestimation. In International Conference on Machine Learning, pages 304\u2013313, 2016.\n\n[22] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional\n\nsmoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.\n\n[23] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv\n\npreprint arXiv:1610.03483, 2016.\n\n[24] Youssef Mroueh and Tom Sercu. Fisher gan. In NIPS, 2017.\n[25] Youssef Mroueh, Stephen Voinea, and Tomaso A Poggio. Learning with group invariant features:\nA kernel perspective. In Advances in Neural Information Processing Systems, pages 1558\u20131566,\n2015.\n\n[26] Youssef Mroueh, Tom Sercu, and Vaibhava Goel. Mcgan: Mean and covariance feature\n\nmatching gan. In ICML, 2017.\n\n[27] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. A topological view of unsupervised\n\nlearning from noisy data. SIAM Journal on Computing, 40(3):646\u2013663, 2011.\n\n[28] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural sam-\nplers using variational divergence minimization. In Advances in Neural Information Processing\nSystems, pages 271\u2013279, 2016.\n\n[29] Augustus Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint\n\narXiv:1606.01583, 2016.\n\n[30] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[31] Anant Raj, Abhishek Kumar, Youssef Mroueh, P Thomas Fletcher, and Bernhard Sch\u00f6lkopf.\n\nLocal group invariant representations via orbit embeddings. In AISTATS, 2017.\n\n[32] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-\nsupervised learning with ladder networks. In Advances in Neural Information Processing\nSystems, pages 3546\u20133554, 2015.\n\n[33] Salah Rifai, Yann Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The manifold\n\ntangent classi\ufb01er. In NIPS, 2011.\n\n[34] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\n2016.\n\n[35] Cicero Nogueira dos Santos, Kahini Wadhawan, and Bowen Zhou. Learning loss func-\ntions for semi-supervised learning via discriminative adversarial networks. arXiv preprint\narXiv:1707.02198, 2017.\n\n[36] Patrice Y Simard, Yann A LeCun, John S Denker, and Bernard Victorri. Transformation\ninvariance in pattern recognition\u2014tangent distance and tangent propagation. In Neural networks:\ntricks of the trade, pages 239\u2013274. Springer, 1998.\n\n[37] M. Spivak. A Comprehensive Introduction to Differential Geometry, volume 1. Publish or\n\nPerish, 3rd edition, 1999.\n\n[38] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical genera-\n\ntive adversarial networks. arXiv preprint arXiv:1511.06390, 2015.\n\n[39] Zhuowen Tu. Learning generative models via discriminative approaches. In Computer Vision\n\nand Pattern Recognition, 2007. CVPR\u201907. IEEE Conference on, pages 1\u20138. IEEE, 2007.\n\n[40] Rene Vidal, Yi Ma, and Shankar Sastry. Generalized principal component analysis (gpca). IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 27(12):1945\u20131959, 2005.\n\n[41] Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked what-where auto-\n\nencoders. arXiv preprint arXiv:1506.02351, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2851, "authors": [{"given_name": "Abhishek", "family_name": "Kumar", "institution": "IBM Research AI"}, {"given_name": "Prasanna", "family_name": "Sattigeri", "institution": "IBM Research"}, {"given_name": "Tom", "family_name": "Fletcher", "institution": "University of Utah"}]}