{"title": "Unsupervised Transformation Learning via Convex Relaxations", "book": "Advances in Neural Information Processing Systems", "page_first": 6875, "page_last": 6883, "abstract": "Our goal is to extract meaningful transformations from raw images, such as varying the thickness of lines in handwriting or the lighting in a portrait. We propose an unsupervised approach to learn such transformations by attempting to reconstruct an image from a linear combination of transformations of its nearest neighbors. On handwritten digits and celebrity portraits, we show that even with linear transformations, our method generates visually high-quality modified images. Moreover, since our method is semiparametric and does not model the data distribution, the learned transformations extrapolate off the training data and can be applied to new types of images.", "full_text": "Unsupervised Transformation Learning\n\nvia Convex Relaxations\n\nTatsunori B. Hashimoto\n\nJohn C. Duchi\n\nPercy Liang\n\nStanford University\nStanford, CA 94305\n\n{thashim,jduchi,pliang}@cs.stanford.edu\n\nAbstract\n\nOur goal is to extract meaningful transformations from raw images, such as varying\nthe thickness of lines in handwriting or the lighting in a portrait. We propose an\nunsupervised approach to learn such transformations by attempting to reconstruct\nan image from a linear combination of transformations of its nearest neighbors. On\nhandwritten digits and celebrity portraits, we show that even with linear transfor-\nmations, our method generates visually high-quality modi\ufb01ed images. Moreover,\nsince our method is semiparametric and does not model the data distribution, the\nlearned transformations extrapolate off the training data and can be applied to new\ntypes of images.\n\n1\n\nIntroduction\n\nTransformations (e.g, rotating or varying the thickness of a handwritten digit) capture important\ninvariances in data, which can be useful for dimensionality reduction [7], improving generative models\nthrough data augmentation [2], and removing nuisance variables in discriminative tasks [3]. However,\ncurrent methods for learning transformations have two limitations. First, they rely on explicit\ntransformation pairs\u2014for example, given pairs of image patches undergoing rotation [12]. Second,\nimprovements in transformation learning have focused on problems with known transformation\nclasses, such as orthogonal or rotational groups [3, 4], while algorithms for general transformations\nrequire solving a dif\ufb01cult, nonconvex objective [12].\nTo tackle the above challenges, we propose a semiparametric approach for unsupervised transforma-\ntion learning. Speci\ufb01cally, given data points x1, . . . , xn, we \ufb01nd K linear transformations A1 . . . AK\nsuch that the vector from each xi to its nearest neighbor lies near the span of A1xi . . . AKxi. The idea\nof using nearest neighbors for unsupervised learning has been explored in manifold learning [1, 7],\nbut unlike these approaches and more recent work on representation learning [2, 13], we do not seek\nto model the full data distribution. Thus, even with relatively few parameters, the transformations we\nlearn naturally extrapolate off the training distribution and can be applied to novel types of points\n(e.g., new types of images).\nOur contribution is to express transformation matrices as a sum of rank-one matrices based on\nsamples of the data. This new objective is convex, thus avoiding local minima (which we show to be\na problem in practice), scales to real-world problems beyond the 10 \u00d7 10 image patches considered\nin past work, and allows us to derive disentangled transformations through a trace norm penalty.\nEmpirically, we show our method is fast and effective at recovering known disentangled transfor-\nmations, improving on past baseline methods based on gradient descent and expectation maximiza-\ntion [11]. On the handwritten digits (MNIST) and celebrity faces (CelebA) datasets, our method \ufb01nds\ninterpretable and disentangled transformations\u2014for handwritten digits, the thickness of lines and the\nsize of loops in digits such as 0 and 9; and for celebrity faces, the degree of a smile.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fby f (x, t) :=(cid:80)K\nf\u2217 :=(cid:80)K\n\n2 Problem statement\nGiven a data point x \u2208 Rd (e.g., an image) and strength scalar t \u2208 R, a transformation is a smooth\nfunction f : Rd \u00d7 R \u2192 Rd. For example, f (x, t) may be a rotated image. For a collection {fk}K\nof transformations, we consider entangled transformations, de\ufb01ned for a vector of strengths t \u2208 RK\nk=1 fk(x, tk). We consider the problem of estimating a collection of transformations\nk given random observations as follows: let pX be a distribution on points x and pT on\ntransformation strength vectors t \u2208 RK, where the components tk are independent under pT. Then\niid\u223c pT, i = 1, . . . , n, we observe the transformations xi = f\u2217(\u02dcxi, ti), while \u02dcxi\nfor \u02dcxi\nand ti are unobserved. Our goal is to estimate the K functions f\u2217\n\niid\u223c pX and ti\n\nk=1 f\u2217\n\n1 , . . . , f\u2217\nK.\n\nk=1\n\n2.1 Learning transformations based on matrix Lie groups\n\nIn this paper, we consider the subset of generic transformations de\ufb01ned via matrix Lie groups. These\nare natural as they map Rd \u2192 Rd and form a family of invertible transformations that we can\nparameterize by an exponential map. We begin by giving a simple example (rotation of points in two\ndimensions) and using this to establish the idea of the exponential map and its linear approximation.\nWe then use these linear approximations for transformation learning.\nA matrix Lie group is a set of invertible matrices closed under multiplication and inversion. In the\nexample of rotation in two dimensions, the set of all rotations is parameterized by the angle \u03b8, and\nany rotation by \u03b8 has representation R\u03b8 =\n. The set of rotation matrices form a\nLie group, as R\u03b8R\u2212\u03b8 = I and the rotations are closed under composition.\n\n(cid:20)cos(\u03b8) \u2212 sin(\u03b8)\n(cid:21)\n\ncos(\u03b8)\n\nsin(\u03b8)\n\n(cid:21)\n\n(cid:20)0 \u22121\n\nLinear approximation.\nIn our context, the important property of matrix Lie groups is that for\ntransformations near the identity, they have local linear approximations (tangent spaces, the associated\nLie algebra), and these local linearizations map back into the Lie group via the exponential map [9].\nAs a simple example, consider the rotation R\u03b8, which satis\ufb01es R\u03b8 = I + \u03b8A + O(\u03b82), where\n, and R\u03b8 = exp(\u03b8A) for all \u03b8 (here exp is the matrix exponential). The in\ufb01nitesimal\nA =\nstructure of Lie groups means that such relationships hold more generally through the exponential\nmap: for any matrix Lie group G \u2282 Rd\u00d7d, there exists \u03b5 > 0 such that for all R \u2208 G with\nm\u22651 Am/m!. In the case that\nG is a one-dimensional Lie group, we have more: for each R near I, there is a t \u2208 R satisfying\n\n(cid:107)R \u2212 I(cid:107) \u2264 \u03b5, there is an A \u2208 Rd\u00d7d such that R = exp(A) = I +(cid:80)\n\n1\n\n0\n\nR = exp(tA) = I +\n\n\u221e(cid:88)\n\nm=1\n\ntmAm\n\nm!\n\n.\n\nThe matrix tA = log R in the exponential map is the derivative of our transformation (as A \u2248\n(R\u2212 I)/t for R\u2212 I small) and is analogous to locally linear neighborhoods in manifold learning [10].\nThe exponential map states that for transformations close to the identity, a linear approximation is\naccurate.\nFor any matrix A, we can also generate a collection of associated 1-dimensional manifolds as follows:\nletting x \u2208 Rd, the set Mx = {exp(tA)x | t \u2208 R} is a manifold containing x. Given two nearby\npoints xt = exp(tA)x and xs = exp(sA)x, the local linearity of the exponential map shows that\n\nxt = exp((t \u2212 s)A)xs = xs + (t \u2212 s)Axs + O((t \u2212 s)2) \u2248 xs + (t \u2212 s)Axs.\n\n(1)\n\nSingle transformation learning. The approximation (1) suggests a learning algorithm for \ufb01nding\na transformation from points on a one-dimensional manifold M: given points x1, . . . , xn sampled\nfrom M, pair each point xi with its nearest neighbor xi. Then we attempt to learn a transformation\nmatrix A satisfying xi \u2248 xi + tiAxi for some small ti for each of these nearest neighbor pairs. As\nnearest neighbor distances (cid:107)xi \u2212 xi(cid:107) \u2192 0 as n \u2192 \u221e [6], the linear approximation (1) eventually\nholds. For a one-dimensional manifold and transformation, we could then solve the problem\n\nminimize\n\n{ti},A\n\n||tiAxi \u2212 (xi \u2212 xi)||2.\n\n(2)\n\nn(cid:88)\n\ni=1\n\n2\n\n\fIf instead of using nearest neighbors, the pairs (xi, xi) were given directly as supervision, then this\nobjective would be a form of \ufb01rst-order matrix Lie group learning [12].\n\nSampling and extrapolation. The learning problem (2) is semiparametric: our goal is to learn\na transformation matrix A while considering the density of points x as a nonparametric nuisance\nvariable. By focusing on the modeling differences between nearby (x, x) pairs, we avoid having to\nspecify the density of x, which results in two advantages: \ufb01rst, the parametric nature of the model\nmeans that the transformations A are de\ufb01ned beyond the support of the training data; and second, by\nnot modeling the full density of x, we can learn the transformation A even when the data comes from\nhighly non-smooth distributions with arbitrary cluster structure.\n\n3 Convex learning of transformations\n\nThe problem (2) makes sense only for one-dimensional manifolds without superposition of transfor-\nmations, so we now extend the ideas (using the exponential map and its linear approximation) to a\nfull matrix Lie group learning problem. We shall derive a natural objective function for this problem\nand provide a few theoretical results about it.\n\n3.1 Problem setup\n\nAs real-world data contains multiple degrees of freedom, we learn several one-dimensional transfor-\nmations, giving us the following multiple Lie group learning problem:\nDe\ufb01nition 3.1. Given data x1 . . . xn \u2208 Rd with xi \u2208 Rd as the nearest neighbor of xi, the nonconvex\ntransformation learning problem objective is\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K(cid:88)\n\nk=1\n\nn(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nminimize\n\nt\u2208Rd\u00d7K ,A\u2208Rd\u00d7d\n\ntikAkxi \u2212 (xi \u2212 xi)\n\n.\n\n(3)\n\nn(cid:88)\n\ni=1\n\nmin\n\nrank(Z)=K\n\nn(cid:88)\n\nThis problem is nonconvex, and prior authors have commented on the dif\ufb01culty of optimizing similar\nobjectives [11, 14]. To avoid this dif\ufb01culty, we will construct a convex relaxation. De\ufb01ne a matrix\nZ \u2208 Rn\u00d7d2, where row Zi is an unrolling of the transformation that approximately takes any xi to\n\u00afxi. Then Eq. (3) can be written as\n\n(cid:107)mat(Zi)xi \u2212 (xi \u2212 xi)(cid:107)2 ,\n\n(4)\n\nwhere mat : Rd2 \u2192 Rd\u00d7d is the matricization operator. Note the rank of Z is at most K, the number\nof transformations. We then relax the rank constraint to a trace norm penalty as\n\nmin\n\n(cid:107)mat(Zi)xi \u2212 (xi \u2212 xi)(cid:107)2 + \u03bb(cid:107)Z(cid:107)\u2217 .\n\n(5)\n\ni=1\n\nHowever, the matrix Z \u2208 Rn\u00d7d2 is too large to handle for real-world problems. Therefore, we\npropose approximating the objective function by modeling the transformation matrices as weighted\nsums of observed transformation pairs. This idea of using sampled pairs is similar to a kernel method:\nwe will show that the true transformation matrices A\u2217\nk can be written as a linear combination of\nrank-one matrices (xi \u2212 xi)x(cid:62)\nAs intuition, assume that we are given a single point xi \u2208 Rd and xi = tiA\u2217xi + xi, where ti \u2208 R\nis unobserved. If we approximate A\u2217 via the rank-one approximation A = (xi \u2212 xi)x(cid:62)\ni , then\n(cid:107)xj(cid:107)\u22122\n2 Axi + xi = xi. This shows that A captures the behavior of A\u2217 on a single point xi. By\nsampling suf\ufb01ciently many examples and appropriately weighting each example, we can construct an\naccurate approximation over all points.\n\ni . 1\n\n1Section 9 of the supplemental material introduces a kernelized version that extends this idea to general\n\nmanifolds.\n\n3\n\n\fLet us subsample x1, . . . , xr (WLOG, these are the \ufb01rst r points). Given these samples, let us write a\nj with weights \u03b1 \u2208 Rn\u00d7r.\ntransformation A as a weighted sum of r rank-one matrices (xj \u2212 xj)x(cid:62)\nWe then optimize these weights:\n\n\u03b1ij(xj \u2212 xj)x(cid:62)\n\nj xi \u2212 (xi \u2212 xi)\n\n+ \u03bb(cid:107)\u03b1(cid:107)\u2217 .\n\n(6)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r(cid:88)\n\nj=1\n\nn(cid:88)\n\ni=1\n\nmin\n\n\u03b1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nNext we show that with high probability, the weighted sum of O(K 2d) samples is close in operator\nnorm to the true transformation matrix A\u2217 (Lemma 3.2 and Theorem 3.3).\n\n3.2 Learning one transformation via subsampling\n\nWe begin by giving the intuition behind the sampling based objective in the one-transformation\ncase. The correctness of rank-one reconstruction is obvious for the special case where the number of\nsamples r = d, and for each i we de\ufb01ne xi = ei, where ei is the i-th canonical basis vector. In this\ncase xi = tiA\u2217ei + ei for some unknown ti \u2208 R. Thus we can easily reconstruct A\u2217 with a weighted\n\ni A\u2217eie(cid:62)\n\ni \u03b1i(xi \u2212 xi)x(cid:62)\n\ni with \u03b1i = t\u22121\n\ni\n\n.\n\ni =(cid:80)\n\nIn the general case, we observe the effects of A\u2217 on a non-orthogonal set of vectors x1 . . . xr as\nxi \u2212 xi = tiA\u2217xi. A similar argument follows by changing our basis to make tixi the i-th canonical\nbasis vector and reconstructing A\u2217 in this new basis. The change of basis matrix for this case is the\n\ncombination of rank-one samples as A =(cid:80)\nmap \u03a3\u22121/2 where \u03a3 =(cid:80)r\nweights \u03b1 \u2208 Rd such that A\u2217 =(cid:80)\n\ni=1 xix(cid:62)\n\ni /r.\ni \u03b1i(xi \u2212 xi)x(cid:62)\n\nOur lemma below makes the intuition precise and shows that given r > d samples, there exists\ni \u03a3\u22121, where \u03a3 is the inner product matrix from\nabove. This justi\ufb01es our objective in Eq. (6), since we can whiten x to ensure \u03a3 = I, and there exists\nweights \u03b1ij which minimizes the objective by reconstructing A\u2217.\nLemma 3.2. Given x1 . . . xr drawn i.i.d. from a density with full-rank covariance, and neighboring\npoints xi . . . xr de\ufb01ned by xi = tiA\u2217xi + xi for some unknown ti (cid:54)= 0 and A\u2217 \u2208 Rd\u00d7d.\nIf r \u2265 d, then there exists weights \u03b1 \u2208 Rr which recover the unknown A\u2217 as\n\nwhere \u03b1i = 1/(rti) and \u03a3 =(cid:80)r\n\nA\u2217 =\n\nr(cid:88)\ni=1 xix(cid:62)\n\ni=1\ni /r.\n\n\u03b1i(xi \u2212 xi)x(cid:62)\n\ni \u03a3\u22121,\n\nr(cid:88)\n\nr(cid:88)\n\nProof. The identity xi = tiA\u2217xi + xi implies ti(\u03a3\u22121/2A\u2217\u03a31/2)\u03a3\u22121/2xi = \u03a3\u22121/2(xi \u2212 xi).\nSumming both sides with weights \u03b1i and multiplying by x(cid:62)\n\ni (\u03a3\u22121/2)(cid:62) yields\n\n\u03b1i\u03a3\u22121/2(xi \u2212 xi)x(cid:62)\n\ni (\u03a3\u22121/2)(cid:62) =\n\n\u03b1iti(\u03a3\u22121/2A\u2217\u03a31/2)\u03a3\u22121/2xix(cid:62)\n\ni (\u03a3\u22121/2)(cid:62)\n\ni=1\n\ni=1\n\n= \u03a3\u22121/2A\u2217\u03a31/2\n\n\u03b1iti\u03a3\u22121/2xix(cid:62)\n\ni (\u03a3\u22121/2)(cid:62).\n\nr(cid:88)\n\ni=1\n\nBy construction of \u03a3\u22121/2 and \u03b1i = 1/(tir), (cid:80)r\n(cid:80)r\ni=1 \u03b1i\u03a3\u22121/2(xi \u2212 xi)x(cid:62)\n\nand symmetric giving the theorem statement.\n\ni (\u03a3\u22121/2)(cid:62) = I. Therefore,\ni (\u03a3\u22121/2)(cid:62) = \u03a3\u22121/2A\u2217\u03a31/2. When x spans Rd, \u03a3\u22121/2 is both invertible\n\ni=1 \u03b1iti\u03a3\u22121/2xix(cid:62)\n\n3.3 Learning multiple transformations\n\nIn the case of multiple transformations, the de\ufb01nition of recovering any single transformation matrix\n1 \u2212 A\u2217\nA\u2217\nk is ambiguous since given transformations A\u2217\n2\nboth locally generate the same family of transformations. We will refer to the transformations\nA\u2217 \u2208 RK\u00d7d\u00d7d and strengths t \u2208 Rn\u00d7K as disentangled if t(cid:62)t/r = \u03c32I for a scalar \u03c32 > 0. This\ncriterion implies that the activation strengths are uncorrelated across the observed data. We will later\n\n2, the matrices A\u2217\n\n1 and A\u2217\n\n2 and A\u2217\n\n1 + A\u2217\n\n4\n\n\fk as A\u2217\n\nk \u2248 Ak =(cid:80)r\n\nshow in section 3.4 that this de\ufb01nition of disentangling captures our intuition, has a closed form\nestimate, and is closely connected to our optimization problem.\nWe show an analogous result to the one-transformation case (Lemma 3.2) which shows that given\nr > K 2 samples we can \ufb01nd weights \u03b1 \u2208 Rr\u00d7k which reconstruct any of the K disentangled\ntransformation matrices A\u2217\nThis implies that minimization over \u03b1 leads to estimates of A\u2217. In contrast to Lemma 3.2, the multiple\ntransformation recovery guarantee is probabilistic and inexact. This is because each summand\n(xi \u2212 xi)x(cid:62)\ni contains effects from all K transformations, and there is no weighting scheme which\nexactly isolates the effects of a single transformation A\u2217\nk. Instead, we utilize the randomness in t to\nk by approximately canceling the contributions from the K \u2212 1 other transformations.\nestimate A\u2217\nTheorem 3.3. Let x1 . . . xr \u2208 Rd be i.i.d isotropic random variables and for each k \u2208 [K], de\ufb01ne\nt1,k . . . tr,k \u2208 R as i.i.d draws from a symmetric random variable with t(cid:62)t/r = \u03c32I \u2208 Rd\u00d7d,\ntik < C1, and (cid:107)xi(cid:107)2 < C2 with probability one.\n\ni=1 \u03b1ik(xi \u2212 xi)x(cid:62)\ni .\n\nthere exists \u03b1 \u2208 Rr\u00d7K such that for all k \u2208 [K],\n\nGiven x1 . . . xr, and neighbors x1 . . . xr de\ufb01ned as xi =(cid:80)K\n(cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)A\u2217\nk \u2212 r(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > \u03b5\n(cid:33)\n\n\u03b1ik(xi \u2212 xi)x(cid:62)\n\n< Kd exp\n\n(cid:32)\n\nP\n\ni\n\ni=1\n\n2K 2(2C 2\n\nk=1 tikA\u2217\n\nkxi + xi for some A\u2217\n\nk \u2208 Rd\u00d7d,\n(cid:33)\n\n\u2212r\u03b52 supk (cid:107)A\u2217\n2 (1 + K\u22121 supk (cid:107)A\u2217\n1 C 2\n\nk(cid:107)\u22122\n\nk(cid:107)\u22121 \u03b5)\n\n.\n\nProof. We give a proof sketch and defer the details to the supplement (Section 7). We claim that\nfor any k, \u03b1ik = tik\n\u03c32r satis\ufb01es the theorem statement. Following the one-dimensional case, we can\nexpand the outer product in terms of the transformation A\u2217 as\n\nr(cid:88)\n\ni=1\n\nAk =\n\n\u03b1ik(xi \u2212 xi)x(cid:62)\n\ni =\n\nK(cid:88)\nk(cid:48) =(cid:80)r\n\nr(cid:88)\nA\u2217\nk(cid:48)\ni=1 \u03b1iktik(cid:48)xix(cid:62)\n\nk(cid:48)=1\n\ni=1\n\n\u03b1iktik(cid:48)xix(cid:62)\ni .\n\nAs before, we must now control the inner terms Z k\nk(cid:48) to be close\nto the identity when k(cid:48) = k and near zero when k(cid:48) (cid:54)= k. Our choice of \u03b1ik = tik\n\u03c32r does this since if\nk(cid:48) (cid:54)= k then \u03b1iktik(cid:48) are zero mean with random sign, resulting in Rademacher concentration bounds\nnear zero, and if k(cid:48) = k then Bernstein bounds show that Z k\n\nk \u2248 I since E[\u03b1ikti] = 1.\n\ni . We want Z k\n\n3.4 Disentangling transformations\nGiven K estimated transformations A1 . . . AK \u2208 Rd\u00d7d and strengths t \u2208 Rn\u00d7K, any invertible\nk WikAk\n\nmatrix W \u2208 RK\u00d7K can be used to \ufb01nd an equivalent family of transformations \u02c6Ai =(cid:80)\nand \u02c6tik =(cid:80)\n\nj W \u22121\n\nkj tij.\n\nDespite this unidenti\ufb01ability, there is a choice of \u02c6A1 . . . \u02c6AK and \u02c6t which is equivalent to A1 . . . AK but\ndisentangled, meaning that across the observed transformation pairs {(xi, xi)}n\ni=1, the strengths for\nany two pairs of transformations are uncorrelated \u02c6t(cid:62)\u02c6t/n = I. This is a necessary condition to captures\nthe intuition that two disentangled transformations will have independent strength distributions. For\nexample, given a set of images generated by changing lighting conditions and sharpness, we expect\nthe sharpness of an image to be uncorrelated to lighting condition.\nFormally, we will de\ufb01ne a set of \u02c6A such that: \u02c6t\u00b7j and \u02c6t\u00b7i are uncorrelated over the observed data,\nand any pair of transformations \u02c6Aix and \u02c6Ajx generate decorrelated outputs. In contrast to mutual\ninformation based approaches to \ufb01nding disentangled representations, our approach only seeks to\ncontrol second moments, but enforces decorrelation both in the latent space (tik) as well as the\nobserved space ( \u02c6Aix).\n\nTheorem 3.4. Given Ak \u2208 Rd\u00d7d, t \u2208 Rn\u00d7k with(cid:80)\nSVD of Z, where each row is Zi =(cid:80)K\n\ni tik = 0, de\ufb01ne Z = U SV (cid:62) \u2208 Rn\u00d7d2 as the\n\nk=1 tikvec(Ak).\nk ) and strengths \u02c6tik = Uik ful\ufb01ls the following properties:\n\nThe transformation \u02c6Ak = Sk,kmat(V (cid:62)\n\n\u2022 (cid:80)\n\nk\n\n\u02c6tik \u02c6Akxi =(cid:80)\n\nk tikAkxi (correct behavior),\n\n5\n\n\f\u2022 \u02c6t(cid:62)\u02c6t = I (uncorrelated in latent space),\n\u2022 E[(cid:104) \u02c6AiX, \u02c6AjX(cid:105)] = 0 for any i (cid:54)= j and random variable X with E[XX(cid:62)] = I (uncorre-\n\nlated in observed space).\n\n(cid:80)\nProof. The \ufb01rst property follows since Z is rank-K by construction, and the rank-K SVD preserves\nk tikAk exactly. The second property follows from the SVD, U(cid:62)U = I. The last property follows\n\u02c6Aj) = 0 for i (cid:54)= j. By linearity of trace: E[(cid:104) \u02c6AiX, \u02c6AjX(cid:105)] =\nfrom V V (cid:62) = I, implying tr( \u02c6A(cid:62)\nSi,iSj,j tr(mat(Vi)mat(Vj)(cid:62)) = 0.\n\ni\n\nInterestingly, this SVD appears in both the convex and subsampling algorithm (Eq. 6) as part of the\nproximal step for the trace norm optimization. Thus the rank sparsity induced by the trace norm\nnaturally favors a small number of disentangled transformations.\n\n4 Experiments\n\nWe evaluate the effectiveness of our sampling-based convex relaxation for learning transformations\nin two ways. In section 4.1, we check whether we can recover a known set of rotation / translation\ntransformations applied to a downsampled celebrity face image dataset. Next, in section 4.2 we\nperform a qualitative evaluation of learning transformations over raw celebrity faces (CelebA) and\nMNIST digits, following recent evaluations of disentangling in adversarial networks [2].\n\n4.1 Recovering known transformations\n\nWe validate our convex relaxation and sampling procedure by recovering synthetic data generated from\nknown transformations, and compare these to existing approaches for learning linear transformations.\nOur experiment consists of recovering synthetic transformations applied to 50 image subsets of a\ndownsampled version (18 \u00d7 18) of CelebA. The resolution and dataset size restrictions were due to\nruntime restrictions from the baseline methods.\nWe compare two versions of our matrix Lie group learning algorithm against two baselines. For our\nmethod, we implement and compare convex relaxation with sampling (Eq. 6) and convex relaxation\nand sampling followed by gradient descent. This second method ensures that we achieve exactly the\ndesired number of transformations K, since trace norm regularization cannot guarantee a \ufb01xed rank\nconstraint. The full convex relaxation (Eq. 5) is not covered here, since it is too slow to run on even\nthe smallest of our experiments.\nAs baselines, we compare to gradient descent with restarts on the nonconvex objective (Eq. 3)\nand the EM algorithm from Miao and Rao [11] run for 20 iterations and augmented with the SVD\nbased disentangling method (Theorem 3.4). These two methods represent the two classes of existing\napproaches to estimating general linear transformations from pairwise data [11].\nOptimization for our methods and gradient descent use minibatch proximal gradient descent with\nAdagrad [8], where the proximal step for trace norm penalties use subsampling down to \ufb01ve thousand\npoints and randomized SVD. All learned transformations were disentangled using the SVD method\nunless otherwise noted (Theorem 3.4).\nFigures 1a and b show the results of recovering a single horizontal translation transformation with\nerror measured in operator norm. Convex relaxation plus gradient descent (Convex+Gradient)\nachieves the same low error across all sampled 50 image subsets. Without the gradient descent,\nconvex relaxation alone does not achieve low error, since the trace norm penalty does not produce\nexactly rank-one results. Gradient descent on the other hand gets stuck in local minima even with\nstepsize tuning and restarts as indicated by the wide variance in error across runs. All methods\noutperform EM while using substantially less time.\nNext, we test disentangling and multiple-transformation recovery for random rotations, horizontal\ntranslations, and vertical translations (Figure 1c). In this experiment, we apply the three types of\ntransformations to the downsampled CelebA images, and evaluate the outputs by measuring the\nminimum-cost matching for the operator norm error between learned transformation matrices and\n\n6\n\n\fthe ground truth. Minimizing this metric requires recovering the true transformations up to label\npermutation.\nWe \ufb01nd results consistent with the one-transform recovery case, where convex relaxation with gradient\ndescent outperforms the baselines. We additionally \ufb01nd SVD based disentangling to be critical to\nrecovering multiple transformations. We \ufb01nd that removing SVD from the nonconvex gradient\ndescent baseline leads to substantially worse results (Figure 1c).\n\n(a) Operator norm error for re-\ncovering a single translation trans-\nform\n\n(b) Sampled convex relaxations\nare faster than baselines\n\n(c) Multiple transformations can be\nrecovered using SVD based disen-\ntangling\n\nFigure 1: Sampled convex relaxation with gradient descent achieves lower error on recovering a\nsingle known transformation (panel a), runs faster than baselines (panel b) and recovers multiple\ndisentangled transformations accurately (panel c).\n\n4.2 Qualitative outputs\n\nWe now test convex relaxation with sampling on MNIST and celebrity faces. We show a subset of\nlearned transformations here and include the full set in the supplemental Jupyter notebook.\n\n(a) Thickness\n\n(b) Blur\n\n(c) Loop size\n\n(d) Angle\n\nFigure 2: Matrix transformations learned on MNIST (top rows) and extrapolating on Kannada\nhandwriting (bottom row). Center column is the original digit, \ufb02anking columns are generated by\napplying the transformation matrix.\n\nOn MNIST digits we trained a \ufb01ve-dimensional linear transformation model over a 20,000 example\nsubset of the data, which took 10 minutes. The components extracted by our approach represent\ncoherent stylistic features identi\ufb01ed by earlier work using neural networks [2] such as thickness,\nrotation as well as some new transformations loop size and blur. Examples of images generated from\nthese learned transformations are shown in \ufb01gure 2. The center column is the original image and all\nother images are generated by repeatedly applying transformation matrices). We also found that the\ntransformations could also sometimes extrapolate to other handwritten symbols, such as Kannada\nhandwriting [5] (last row, \ufb01gure 2). Finally, we visualize the learned transformations by summing the\nestimated transformation strength for each transformation across the minimum spanning tree on the\nobserved data (See supplement section 9 for details). This visualization demonstrates that the learned\nrepresentation of the data captures the style of the digit, such as thickness and loop size and ignores\nthe digit identity. This is a highly desirable trait for the algorithm, as it means that we can extract\ncontinuous factors of variations such as digit thickness without explicitly specifying and removing\ncluster structure in the data (Figure 3).\n\n7\n\n\f(a) PCA\n\nFigure 3: Embedding of MNIST digits based on\ntwo transformations: thickness and loop size. The\nlearned transformations captures extracts continu-\nous, stylistic features which apply across multiple\nclusters despite being given no cluster information.\n\n(b) InfoGAN\n\nFigure 4: Baselines applied to the same\nMNIST data often entangle digit identity\nand style.\n\nIn contrast to our method, many baseline methods inadvertently capture digit identity as part of\nthe learned transformation. For example, the \ufb01rst component of PCA simply adds a zero to every\nimage (Figure 4), while the \ufb01rst component of InfoGAN has higher \ufb01delity in exchange for training\ninstability, which often results in mixing digit identity and multiple transformations (Figure 4).\nFinally, we apply our method to the celebrity faces dataset and \ufb01nd that we are able to extract\nhigh-level transformations using only linear models. We trained a our model on a 1000-dimensional\nPCA projection of CelebA constructed from the original 116412 dimensions with K = 20, and\nfound both global scene transformation such as sharpness and contrast (Figure 5a) and more high\nlevel-transformations such as adding a smile (Figure 5b).\n\n(a) Contrast / Sharpness\n\n(b) Smiling / Skin tone\n\nFigure 5: Learned transformations for celebrity faces capture both simple (sharpness) and high-level\n(smiling) transformations. For each panel, the center column is the original image, and columns to\nthe left and right were generated by repeatedly applying the learnt transformation.\n\n8\n\n\f5 Related Work and Discussion\n\nLearning transformation matrices, also known as Lie group learning, has a long history with the\nclosest work to ours being Miao and Rao [11] and Rao and Ruderman [12]. These earlier methods\nuse a Taylor approximation to learn a set of small (< 10 \u00d7 10) transformation matrices given pairs of\nimage patches undergoing a small transformation. In contrast, our work does not require supervision\nin the form of transformation pairs and provides a scalable new convex objective function.\nThere have been improvements to Rao and Ruderman [12] focusing on removing the Taylor approxi-\nmation in order to learn transformations from distant examples: Cohen and Welling [3, 4] learned\ncommutative and 3d-rotation Lie groups under a strong assumption of uniform density over rotations.\nSohl-Dickstein et al. [14] learn commutative transformations generated by normal matrices using\neigendecompositions and supervision in the form of successive 17 \u00d7 17 image patches in a video.\nOur work differs because we seek to learn multiple, general transformation matrices from large,\nhigh-dimensional datasets. Because of this difference, our algorithm focuses on scalability and\navoiding local minima at the expense of utilizing a less accurate \ufb01rst-order Taylor approximation.\nThis approximation is reasonable, since we \ufb01t our model to nearest neighbor pairs which are by\nde\ufb01nition close to each other. Empirically, we \ufb01nd that these approximations result in a scalable\nalgorithm for unsupervised recovery of transformations.\nLearning to transform between neighbors on a nonlinear manifold has been explored in Doll\u00e1r\net al. [7] and Bengio and Monperrus [1]. Both works model a manifold by predicting the linear\nneighborhoods around points using nonlinear functions (radial basis functions in Doll\u00e1r et al. [7] and\na one-layer neural net in Bengio and Monperrus [1]). In contrast to these methods, which begin with\nthe goal of learning all manifolds, we focus on a class of linear transformations, and treat the general\nmanifold problem as a special kernelization. This has three bene\ufb01ts: \ufb01rst, we avoid the high model\ncomplexity necessary for general manifold learning. Second, extrapolation beyond the training data\noccurs explicitly from the linear parametric form of our model (e.g., from digits to Kannada). Finally,\nlinearity leads to a de\ufb01nition of disentangling based on correlations and a SVD based method for\nrecovering disentangled representations.\nIn summary, we have presented an unsupervised approach for learning disentangled representa-\ntions via linear Lie groups. We demonstrated that for image data, even a linear model is sur-\nprisingly effective at learning semantically meaningful transformations. Our results suggest that\nthese semi-parametric transformation models are promising for identifying semantically meaningful\nlow-dimensional continuous structures from high-dimensional real-world data.\n\nAcknowledgements.\n\nWe thank Arun Chaganty for helpful discussions and comments. This work was supported by\nNSF-CAREER award 1553086, DARPA (Grant N66001-14-2-4055), and the DAIS ITA program\n(W911NF-16-3-0001).\n\nReproducibility.\n\nCode, data, and experiments can be found on Codalab Worksheets (http://bit.ly/2Aj5tti).\n\n9\n\n\f", "award": [], "sourceid": 3449, "authors": [{"given_name": "Tatsunori", "family_name": "Hashimoto", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}, {"given_name": "John", "family_name": "Duchi", "institution": "Stanford"}]}