{"title": "A Non-generative Framework and Convex Relaxations for Unsupervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3306, "page_last": 3314, "abstract": "We give a novel formal theoretical framework for unsupervised learning with two distinctive characteristics. First, it does not assume any generative model and based on a worst-case performance metric. Second, it is comparative, namely performance is measured with respect to a given hypothesis class. This allows to avoid known computational hardness results and improper algorithms based on convex relaxations. We show how several families of unsupervised learning models, which were previously only analyzed under probabilistic assumptions and are otherwise provably intractable, can be efficiently learned in our framework by convex optimization.", "full_text": "A Non-generative Framework and Convex\nRelaxations for Unsupervised Learning\n\nElad Hazan\n\nPrinceton University\n35 Olden Street 08540\n\nehazan@cs.princeton.edu.\n\nTengyu Ma\n\nPrinceton University\n\n35 Olden Street, NJ 08540\n\ntengyu@cs.princeton.edu.\n\nAbstract\n\nWe give a novel formal theoretical framework for unsupervised learning with two\ndistinctive characteristics. First, it does not assume any generative model and\nbased on a worst-case performance metric. Second, it is comparative, namely\nperformance is measured with respect to a given hypothesis class. This allows\nto avoid known computational hardness results and improper algorithms based\non convex relaxations. We show how several families of unsupervised learning\nmodels, which were previously only analyzed under probabilistic assumptions and\nare otherwise provably intractable, can be ef\ufb01ciently learned in our framework by\nconvex optimization.\n\n1\n\nIntroduction\n\nUnsupervised learning is the task of learning structure from unlabelled examples. Informally, the\nmain goal of unsupervised learning is to extract structure from the data in a way that will enable\nef\ufb01cient learning from future labelled examples for potentially numerous independent tasks.\nIt is useful to recall the Probably Approximately Correct (PAC) learning theory for supervised learn-\ning [28], based on Vapnik\u2019s statistical learning theory [29]. In PAC learning, the learning can access\nlabelled examples from an unknown distribution. On the basis of these examples, the learner con-\nstructs a hypothesis that generalizes to unseen data. A concept is said to be learnable with respect to\na hypothesis class if there exists an (ef\ufb01cient) algorithm that outputs a generalizing hypothesis with\nhigh probability after observing polynomially many examples in terms of the input representation.\nThe great achievements of PAC learning that made it successful are its generality and algorithmic\napplicability: PAC learning does not restrict the input domain in any way, and thus allows very\ngeneral learning, without generative or distributional assumptions on the world. Another important\nfeature is the restriction to speci\ufb01c hypothesis classes, without which there are simple impossibility\nresults such as the \u201cno free lunch\u201d theorem. This allows comparative and improper learning of\ncomputationally-hard concepts.\nThe latter is a very important point which is often understated. Consider the example of sparse\nregression, which is a canonical problem in high dimensional statistics. Fitting the best sparse vector\nto linear prediction is an NP-hard problem [20]. However, this does not prohibit improper learning,\nsince we can use a `1 convex relaxation for the sparse vectors (famously known as LASSO [26]).\nUnsupervised learning, on the other hand, while extremely applicative and well-studied, has not seen\nsuch an inclusive theory. The most common approaches, such as restricted Boltzmann machines,\ntopic models, dictionary learning, principal component analysis and metric clustering, are based\nalmost entirely on generative assumptions about the world. This is a strong restriction which makes\nit very hard to analyze such approaches in scenarios for which the assumptions do not hold. A\nmore discriminative approach is based on compression, such as the Minimum Description Length\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fcriterion. This approach gives rise to provably intractable problems and doesn\u2019t allow improper\nlearning.\n\nMain results. We start by proposing a rigorous framework for unsupervised learning which al-\nlows data-dependent, comparative learning without generative assumptions about the world. It is\ngeneral enough to encompass previous methods such as PCA, dictionary learning and topic models.\nOur main contribution are optimization-based relaxations and ef\ufb01cient algorithms that are shown to\nimproperly probably learn previous models, speci\ufb01cally:\n\n1. We consider the classes of hypothesis known as dictionary learning. We give a more general\nhypothesis class which encompasses and generalizes it according to our de\ufb01nitions. We\nproceed to give novel polynomial-time algorithms for learning the broader class. These\nalgorithms are based on new techniques in sum-of-squares convex relaxations.\nAs far as we know, this is the \ufb01rst result for ef\ufb01cient improper learning of dictionaries with-\nout generative assumptions. Moreover, our result handles polynomially over-complete dic-\ntionaries, while previous works [4, 8] apply to at most constant factor over-completeness.\n2. We give ef\ufb01cient algorithms for learning a new hypothesis class which we call spectral\nautoencoders. We show that this class generalizes, according to our de\ufb01nitions, the class of\nPCA (principal component analysis) and its kernel extensions.\n\nStructure of this paper.\nIn the following chapter we a non-generative, distribution-dependent def-\ninition for unsupervised learning which mirrors that of PAC learning for supervised learning. We\nthen proceed to an illustrative example and show how Principal Component Analysis can be for-\nmally learned in this setting. The same section also gives a much more general class of hypothesis\nfor unsupervised learning which we call polynomial spectral decoding, and show how they can be\nef\ufb01cient learned in our framework using convex optimization. Finally, we get to our main contri-\nbution: a convex optimization based methodology for improper learning a wide class of hypothesis,\nincluding dictionary learning.\n\n1.1 Previous work\n\nThe vast majority of work on unsupervised learning, both theoretical as well as applicative, focuses\non generative models. These include topic models [11], dictionary learning [13], Deep Boltzmann\nMachines and deep belief networks [24] and many more. Many times these models entail non-\nconvex optimization problems that are provably NP-hard to solve in the worst-case.\nA recent line of work in theoretical machine learning attempts to give ef\ufb01cient algorithms for these\nmodels with provable guarantees. Such algorithms were given for topic models [5], dictionary\nlearning [6, 4], mixtures of gaussians and hidden Markov models [15, 3] and more. However, these\nworks retain, and at times even enhance, the probabilistic generative assumptions of the underlying\nmodel. Perhaps the most widely used unsupervised learning methods are clustering algorithms such\nas k-means, k-medians and principal component analysis (PCA), though these lack generalization\nguarantees. An axiomatic approach to clustering was initiated by Kleinberg [17] and pursued further\nin [9]. A discriminative generalization-based approach for clustering was undertaken in [7] within\nthe model of similarity-based clustering.\nAnother approach from the information theory literature studies with online lossless compression.\nThe relationship between compression and machine learning goes back to the Minimum Description\nLength criterion [23]. More recent work in information theory gives online algorithms that attain\noptimal compression, mostly for \ufb01nite alphabets [1, 21]. For in\ufb01nite alphabets, which are the main\nobject of study for unsupervised learning of signals such as images, there are known impossibility\nresults [16]. This connection to compression was recently further advanced, mostly in the context\nof textual data [22].\nIn terms of lossy compression, Rate Distortion Theory (RDT) [10, 12] is intimately related to our\nde\ufb01nitions, as a framework for \ufb01nding lossy compression with minimal distortion (which would\ncorrespond to reconstruction error in our terminology). Our learnability de\ufb01nition can be seen of\nan extension of RDT to allow improper learning and generalization error bounds. Another learn-\ning framework derived from lossy compression is the information bottleneck criterion [27], and its\n\n2\n\n\flearning theoretic extensions [25]. The latter framework assumes an additional feedback signal, and\nthus is not purely unsupervised.\nThe downside of the information-theoretic approaches is that worst-case competitive compression\nis provably computationally hard under cryptographic assumptions. In contrast, our compression-\nbased approach is based on learning a restriction to a speci\ufb01c hypothesis class, much like PAC-\nlearning. This circumvents the impossibility results and allows for improper learning.\n\n2 A formal framework for unsupervised learning\n\nThe basis constructs in an unsupervised learning setting are:\n\n1. Instance domain X , such as images, text documents, etc. Target space, or range, Y. We\nusually think of X = Rd,Y = Rk with d k. (Alternatively, Y can be all sparse vectors\nin a larger space. )\n\n2. An unknown, arbitrary distribution D on domain X .\n3. A hypothesis class of decoding and encoding pairs,\n\nH\u2713{ (h, g) 2 {X 7! Y} \u21e5 {Y 7! X}},\n\nwhere h is the encoding hypothesis and g is the decoding hypothesis.\n\n4. A loss function ` : H \u21e5 X 7! R>0 that measures the reconstruction error,\n\n`((g, h), x) .\n\n2. The ratio-\nFor example, a natural choice is the `2-loss `((g, h), x) = kg(h(x)) xk2\nnale here is to learn structure without signi\ufb01cantly compromising supervised learning for\narbitrary future tasks. Near-perfect reconstruction is suf\ufb01cient as formally proved in Ap-\npendix 6.1. Without generative assumptions, it can be seen that near-perfect reconstruction\nis also necessary.\n\nFor convenience of notation, we use f as a shorthand for (h, g) 2H , a member of the hypothesis\nclass H. Denote the generalization ability of an unsupervised learning algorithm with respect to a\ndistribution D as\n\nloss\nD\n\n(f ) = E\nx\u21e0D\n\n[`(f, x)].\n\nWe can now de\ufb01ne the main object of study: unsupervised learning with respect to a given hypothe-\nsis class. The de\ufb01nition is parameterized by real numbers: the \ufb01rst is the encoding length (measured\nin bits) of the hypothesis class. The second is the bias, or additional error compared to the best\nhypothesis. Both parameters are necessary to allow improper learning.\nDe\ufb01nition 2.1. We say that instance D,X is (k, )-C -learnable with respect to hypothesis class H if\nexists an algorithm that for every , \" > 0, after seeing m(\", ) = poly(1/\", log(1/), d) examples,\nreturns an encoding and decoding pair (h, g) (not necessarily from H) such that:\n\n1. with probability at least 1 , lossD((h, g)) 6 min(h,g)2H lossD((h, g)) + \" + .\n2. h(x) has an explicit representation with length at most k bits.\n\nFor convenience we typically encode into real numbers instead of bits. Real encoding can often\n(though not in the worst case) be trivially transformed to be binary with a loss of logarithmic factor.\nFollowing PAC learning theory, we can use uniform convergence to bound the generalization error\nof the empirical risk minimizer (ERM). De\ufb01ne the empirical loss for a given sample S \u21e0D m as\n\nloss\nS\n\n(f ) =\n\n1\n\nm \u00b7Xx2S\n\n`(f, x)\n\nDe\ufb01ne the ERM hypothesis for a given sample S \u21e0D m as \u02c6fERM = arg min \u02c6f2H\n\nlossS( \u02c6f ) .\n\n3\n\n\fFor a hypothesis class H, a loss function ` and a set of m samples S \u21e0D m, de\ufb01ne the empirical\nRademacher complexity of H with respect to ` and S as, 1\n\u21e0{\u00b11}m\"sup\nmXx2S\n1\n\ni`(f, x)#\n\nRS,`(H) =\n\nf2H\n\nE\n\nLet the Rademacher complexity of H with respect to distribution D and loss ` as Rm(H) =\nES\u21e0Dm[RS,`(H)]. When it\u2019s clear from the context, we will omit the subscript `.\nWe can now state and apply standard generalization error results. The proof of following theorem is\nalmost identical to [19, Theorem 3.1]. For completeness we provide a proof in Appendix 6.\nTheorem 2.1. For any > 0, with probability 1 , the generalization error of the ERM hypothesis\nis bounded by:\n\nloss\nD\n\n( \u02c6fERM ) 6 min\nf2H\n\nloss\nD\n\n(f ) + 6Rm(H) +s 4 log 1\n\n2m\n\n\n\nAn immediate corollary of the theorem is that as long as the Rademacher complexity of a hypothesis\nclass approaches zero as the number of examples goes to in\ufb01nity, it can be C learned by an inef\ufb01cient\nalgorithm that optimizes over the hypothesis class by enumeration and outputs an best hypothesis\nwith encoding length k and bias = 0. Not surprisingly such optimization is often intractable and\nhences the main challenge is to design ef\ufb01cient algorithms. As we will see in later sections, we often\nneed to trade the encoding length and bias slightly for computational ef\ufb01ciency.\nNotations: For every vector z 2 Rd1 \u2326Rd2, we can view it as a matrix of dimension d1\u21e5d2, which\nis denoted as M(z). Therefore in this notation, M(u \u2326 v) = uv>. Let vmax(\u00b7) : (Rd)\u23262 ! Rd be\nthe function that compute the top right-singular vector of some vector in (Rd)\u23262 viewed as a matrix.\nThat is, for z 2 (Rd)\u23262, then vmax(z) denotes the top right-singular vector of M(z). We also\noverload the notation vmax for generalized eigenvectors of higher order tensors. For T 2 (Rd)\u2326`,\nlet vmax(T ) = argmaxkxk61 T (x, x, . . . , x) where T (\u00b7) denotes the multi-linear form de\ufb01ned by\ntensor T .\n3 Spectral autoencoders: unsupervised learning of algebraic manifolds\n3.1 Algebraic manifolds\n\nThe goal of the spectral autoencoder hypothesis class we de\ufb01ne henceforth is to learn the represen-\ntation of data that lies on a low-dimensional algebraic variety/manifolds. The linear variety, or linear\nmanifold, de\ufb01ned by the roots of linear equations, is simply a linear subspace. If the data resides in\na linear subspace, or close enough to it, then PCA is effective at learning its succinct representation.\nOne extension of the linear manifolds is the set of roots of low-degree polynomial equations. For-\nmally, let k, s be integers and let c1, . . . , cdsk 2 Rds be a set of vectors in ds dimension, and\nconsider the algebraic variety\n\nM =x 2 Rd : 8i 2 [ds k],hci, x\u2326si = 0 .\n\nObserve that here each constraint hci, x\u2326si is a degree-s polynomial over variables x, and when\ns = 1 the variety M becomes a liner subspace. Let a1, . . . , ak 2 Rds be a basis of the subspaces\northogonal to all of c1, . . . , cdsk, and let A 2 Rk\u21e5ds contains ai as rows. Then we have that given\nx 2M , the encoding\n\ny = Ax\u2326s\n\npins down all the unknown information regarding x. In fact, for any x 2M , we have A>Ax\u2326s =\nx\u2326s and therefore x is decodable from y. The argument can also be extended to the situation when\nthe data point is close to M (according to a metric, as we discuss later). The goal of the rest of the\nsubsections is to learn the encoding matrix A given data points residing close to M.\n\n1Technically, this is the Rademacher complexity of the class of functions ` H . However, since ` is usually\n\n\ufb01xed for certain problem, we emphasize in the de\ufb01nition more the dependency on H.\n\n4\n\n\fWarm up: PCA and kernel PCA.\nIn this section we illustrate our framework for agnostic unsu-\npervised learning by showing how PCA and kernel PCA can be ef\ufb01ciently learned within our model.\nThe results of this sub-section are not new, and given only for illustrative purposes. The class of hy-\npothesis corresponding to PCA operates on domain X = Rd and range Y = Rk for some k < d via\nlinear operators. In kernel PCA, the encoding linear operator applies to the s-th tensor power x\u2326s\nof the data. That is, the encoding and decoding are parameterized by a linear operator A 2 Rk\u21e5ds,\n\nk,s =(hA, gA) : hA(x) = Ax\u2326s, , gA(y) = A\u2020y ,\nHpca\n\nwhere A\u2020 denotes the pseudo-inverse of A. The natural loss function here is the Euclidean norm,\n`((g, h), x) = kx\u2326s g(h(x))k2 = k(I A\u2020A)x\u2326sk2 .\nTheorem 3.1. For a \ufb01xed constant s > 1, the class Hpca\nlength k and bias = 0.\nThe proof of the Theorem follows from two simple components: a) \ufb01nding the ERM among Hpca\nk,s\ncan be ef\ufb01ciently solved by taking SVD of covariance matrix of the (lifted) data points. b) The\nRademacher complexity of the hypothesis class is bounded by O(ds/m) for m examples. Thus by\nTheorem 2.1 the minimizer of ERM generalizes. The full proof is deferred to Appendix A.\n\nk,s is ef\ufb01ciently C -learnable with encoding\n\n3.2 Spectral Autoencoders\nIn this section we give a much broader set of hypothesis, encompassing PCA and kernel-PCA, and\nshow how to learn them ef\ufb01ciently. Throughout this section we assume that the data is normalized to\nEuclidean norm 1, and consider the following class of hypothesis which naturally generalizes PCA:\nDe\ufb01nition 3.1 (Spectral autoencoder). We de\ufb01ne the class Hsa\nk,s as the following set of all hypothesis\n(g, h),\n\nk =\u21e2(h, g) :\nHsa\n\ng(y) = vmax(By), B 2 Rds\u21e5k .\nh(x) = Ax\u2326s, A 2 Rk\u21e5ds\n\n(3.1)\n\nWe note that this notion is more general than kernel PCA: suppose some (g, h) 2H pca\nk,s has re-\nconstruction error \", namely, A\u2020Ax\u2326s is \"-close to x\u2326s in Euclidean norm. Then by eigenvector\nperturbation theorem, we have that vmax(A\u2020Ax\u2326s) also reconstructs x with O(\") error, and there-\nfore there exists a PSCA hypothesis with O(\") error as well . Vice versa, it\u2019s quite possible that for\nevery A, the reconstruction A\u2020Ax\u2326s is far away from x\u2326s so that kernel PCA doesn\u2019t apply, but\nwith spectral decoding we can still reconstruct x from vmax(A\u2020Ax\u2326s) since the top eigenvector of\nA\u2020Ax\u2326s is close x.\nHere the key matter that distinguishes us from kernel PCA is in what metric x needs to be close to\nthe manifold so that it can be reconstructed. Using PCA, the requirement is that x is in Euclidean\ndistance close to M (which is a subspace), and using kernel PCA x\u23262 needs to be in Euclidean\ndistance close to the null space of ci\u2019s. However, Euclidean distances in the original space and lifted\nspace typically are meaningless for high-dimensional data since any two data points are far away\nwith each other in Euclidean distance. The advantage of using spectral autoencoders is that in the\nlifted space the geometry is measured by spectral norm distance that is much smaller than Euclidean\ndistance (with a potential gap of d1/2). The key here is that though the dimension of lifted space is\nd2, the objects of our interests is the set of rank-1 tensors of the form x\u23262. Therefore, spectral norm\ndistance is a much more effective measure of closeness since it exploits the underlying structure of\nthe lifted data points.\nWe note that spectral autoencoders relate to vanishing component analysis [18]. When the data is\nclose to an algebraic manifold, spectral autoencoders aim to \ufb01nd the (small number of) essential\nnon-vanishing components in a noise robust manner.\n\n3.3 Learnability of polynomial spectral decoding\nFor simplicity we focus on the case when s = 2. Ideally we would like to learn the best encoding-\ndecoding scheme for any data distribution D. Though there are technical dif\ufb01culties to achieve such\na general result. A natural attempt would be to optimize the loss function f (A, B) = kg(h(x)) \nxk2 = kx vmax(BAx\u23262)k2. Not surprisingly, function f is not a convex function with respect to\nA, B, and in fact it could be even non-continuous (if not ill-de\ufb01ned)!\n\n5\n\n\fHere we make a further realizability assumption that the data distribution D admits a reasonable\nencoding and decoding pair with reasonable reconstruction error.\nDe\ufb01nition 3.2. We say a data distribution D is (k, \")-regularly spectral decodable if there exist\nA 2 Rk\u21e5d2 and B 2 Rd2\u21e5k with kBAkop 6 \u2327 such that for x \u21e0D , with probability 1, the\nencoding y = Ax\u23262 satis\ufb01es that\n(3.2)\n\nM(By) = M(BAx\u23262) = xx> + E ,\nwhere kEkop 6 \". Here \u2327 > 1 is treated as a \ufb01xed constant globally.\nTo interpret the de\ufb01nition, we observe that if data distribution D is (k, \")-regularly spectrally decod-\nable, then by equation (3.2) and Wedin\u2019s theorem (see e.g. [30] ) on the robustness of eigenvector to\nperturbation, M(By) has top eigenvector2 that is O(\")-close to x itself. Therefore, de\ufb01nition 3.2 is a\nsuf\ufb01cient condition for the spectral decoding algorithm vmax(By) to return x approximately, though\nit might be not necessary. Moreover, this condition partially addresses the non-continuity issue of\nusing objective f (A, B) = kxvmax(BAx\u23262)k2, while f (A, B) remains (highly) non-convex. We\nresolve this issue by using a convex surrogate.\nOur main result concerning the learnability of the aforementioned hypothesis class is:\nTheorem 3.2. The hypothesis class Hsa\n with respect to (k, \")-regular distributions in polynomial time.\nOur approach towards \ufb01nding an encoding and decoding matrice A, B is to optimize the objective,\n(3.3)\n\nk,2 is C - learnable with encoding length O(\u2327 4k4/4) and bias\n\nminimize f (R) = EhRx\u23262 x\u23262opi\n\ns.t. kRkS1 6 \u2327k\n\nwhere k\u00b7k S1 denotes the Schatten 1-norm. Suppose D is (k, \")-regularly decodable, and suppose\nhA and gB are the corresponding encoding and decoding function. Then we see that R = AB will\nsatis\ufb01es that R has rank at most k and f (R) 6 \". On the other hand, suppose one obtains some R\nof rank k0 such that f (R) 6 , then we can produce hA and gB with O() reconstruction simply by\nchoosing A 2 Rk0\u21e5d2B and B 2 Rd2\u21e5k0 such that R = AB.\nWe use (non-smooth) Frank-Wolfe to solve objective (3.3), which in particular returns a low-rank\nsolution. We defer the proof of Theorem 3.2 to the Appendix A.1. With a slightly stronger assump-\ntions on the data distribution D, we can reduce the length of the code to O(k2/\"2) from O(k4/\"4).\nSee details in Appendix B.\n4 A family of optimization encodings and ef\ufb01cient dictionary learning\n\nIn this section we give ef\ufb01cient algorithms for learning a family of unsupervised learning algorithms\ncommonly known as \u201ddictionary learning\u201d. In contrast to previous approaches, we do not construct\nan actual \u201ddictionary\u201d, but rather improperly learn a comparable encoding via convex relaxations.\nWe consider a different family of codes which is motivated by matrix-based unsupervised learning\nmodels such as topic-models, dictionary learning and PCA. This family is described by a matrix\nA 2 Rd\u21e5r which has low complexity according to a certain norm k\u00b7k \u21b5, that is, kAk\u21b5 6 c\u21b5. We can\nparametrize a family of hypothesis H according to these matrices, and de\ufb01ne an encoding-decoding\npair according to\n\nhA(x) = arg min\nkyk6k\n\n1\nd |x Ay|1 , gA(y) = Ay\n\nWe choose `1 norm to measure the error mostly for convenience, though it can be quite \ufb02exible.\nThe different norms k\u00b7k \u21b5,k\u00b7k over A and y give rise to different learning models that have been\nconsidered before. For example, if these are Euclidean norms, then we get PCA. If k\u00b7k \u21b5 is the max\ncolumn `2 or `1 norm and k\u00b7k b is the `0 norm, then this corresponds to dictionary learning (more\ndetails in the next section).\nThe optimal hypothesis in terms of reconstruction error is given by\n\nA? = arg min\nkAk\u21b56c\u21b5\n\nx\u21e0D\uf8ff 1\n\nd |x gA(hA(x))|1 = arg min\n\nkAk\u21b56c\u21b5\n\nE\n\nx\u21e0D\uf8ff min\n\ny2Rr:kyk6k\n\nE\n\n1\n\nd |x Ay|1 .\n\n2Or right singular vector when M(By) is not symmetric\n\n6\n\n\fThe loss function can be generalized to other norms, e.g., squared `2 loss, without any essential\nchange in the analysis. Notice that this optimization objective derived from reconstruction error\nis identically the one used in the literature of dictionary learning. This can be seen as another\njusti\ufb01cation for the de\ufb01nition of unsupervised learning as minimizing reconstruction error subject to\ncompression constraints.\nThe optimization problem above is notoriously hard computationally, and signi\ufb01cant algorithmic\nand heuristic literature attempted to give ef\ufb01cient algorithms under various distributional assump-\ntions(see [6, 4, 2] and the references therein). Our approach below circumvents this computational\nhardness by convex relaxations that result in learning a different creature, albeit with comparable\ncompression and reconstruction objective.\n\nImproper dictionary learning: overview\n\n4.1\nWe assume the max column `1 norm of A is at most 1 and the `1 norm of y is assumed to be at\nmost k. This is a more general setting than the random dictionaries (up to a re-scaling) that previous\nworks [6, 4] studied. 3In this case, the magnitude of each entry of x is on the order of pk if y has\nk random \u00b11 entries. We think of our target error per entry as much smaller than 14. We consider\ndict that are parametrized by the dictionary matrix A = Rd\u21e5r,\nHk\n\nk =(hA, gA) : A 2 Rd\u21e5r,kAk`1!`1 6 1 ,\nHdict\nwhere hA(x) = arg min\nkyk16k |x Ay|1 , gA(y) = Ay\n\nHere we allow r to be larger than d, the case that is often called over-complete dictionary. The\nchoice of the loss can be replaced by `2 loss (or other Lipschitz loss) without any additional efforts,\nthough for simplicity we stick to `1 loss. De\ufb01ne A? to be the the best dictionary under the model\nand \"? to be the optimal error,\n\nA? = arg minkAk`1!`161 Ex\u21e0D\u21e5miny2Rr:kyk16k |x Ay|1\u21e4\n\n\"? = Ex\u21e0D\u21e5 1\n\nd \u00b7 |x gA?(hA?(x))|1\u21e4 .\n\nAlgorithm 1 group encoding/decoding for improper dictionary learning\nInputs: N data points X 2 Rd\u21e5N \u21e0D N. Convex set Q. Sampling probability \u21e2.\n\n1. Group encoding: Compute\n\n(4.1)\n\n(4.2)\n\nZ = arg min\n\nC2Q |X C|1 ,\n\nand let Y = h(X) = P\u2326(Z) , where P\u2326(B) is a random sampling of B where each entry\nis picked with probability \u21e2.\n\n2. Group decoding: Compute g(Y ) = arg minC2Q |P\u2326(C) Y |1 .\n\nk\n\nis C -learnable with encoding length\n\nTheorem 4.1. For any > 0, p > 1, the hypothesis class Hdict\n\u02dcO(k2r1/p/2), bias + O(\"?) and sample complexity dO(p) in time nO(p2)\nWe note that here r can be potentially much larger than d since by choosing a large constant p the\noverhead caused by r can be negligible. Since the average size of the entries is pk, therefore we\ncan get the bias smaller than average size of the entries with code length roughly \u21e1 k.\nThe proof of Theorem 4.1 is deferred to supplementary. To demonstrate the key intuition and tech-\nnique behind it, in the rest of the section we consider a simpler algorithm that achieves a weaker\ngoal: Algorithm 1 encodes multiple examples into some codes with the matching average encoding\nlength \u02dcO(k2r1/p/2), and these examples can be decoded from the codes together with reconstruc-\ntion error \"? + . Next, we outline the analysis of Algorithm 1, and we will show later that one can\nreduce the problem of encoding a single examples to the problem of encoding multiple examples.\n3The assumption can be relaxed to that A has `1 norm at most k and `2-norm at most pd straightforwardly.\n4We are conservative in the scaling of the error here. Error much smaller than pk is already meaningful.\n\n7\n\n\fHere we overload the notation gA?(hA?(\u00b7)) so that gA?(hA?(X)) denotes the collection of all the\ngA?(hA?(xj)) where xj is the j-th column of X. Algorithm 1 assumes that there exists a convex set\nQ \u21e2 Rd\u21e5N such that\n\ngA?(hA?(X)) : X 2 Rd\u21e5N \u21e2{ AY : kAk`1!`1 6 1,kY k`1!`1 6 k}\u21e2 Q .\n\n(4.3)\n\nThat is, Q is a convex relaxation of the group of reconstructions allowed in the class Hdict. Algo-\nrithm 1 \ufb01rst uses convex programming to denoise the data X into a clean version Z, which belongs\nto the set Q. If the set Q has low complexity, then simple random sampling of Z 2 Q serves as a\ngood encoding.\nThe following Lemma shows that if Q has low complexity in terms of sampling Rademacher width,\nthen Algorithm 1 will give a good group encoding and decoding scheme.\nLemma 4.2. Suppose convex Q \u21e2 Rd\u21e5N satis\ufb01es condition (4.3). Then, Algorithm 1 gives a group\nencoding and decoding pair such that with probability 1 , the average reconstruction error is\nbounded by \"? + O(pSRW m(Q) + O(plog(1/)/m) where m = \u21e2N d and SRW m(\u00b7) is the\nsampling Rademacher width (de\ufb01ned in appendix), and the average encoding length is \u02dcO(\u21e2d).\nTowards analyzing the algorithm, we will show that the difference between Z and X is comparable\nto \"?, which is a direct consequence of the optimization over a large set Q that contains optimal\nreconstruction. Then we prove that the sampling procedure doesn\u2019t lose too much information given\na denoised version of the data is already observed, and thus one can reconstruct Z from Y .\nThe novelty here is to use these two steps together to denoise and achieve a short encoding. The\ntypical bottleneck of applying convex relaxation on matrix factorization based problem (or any other\nproblem) is the dif\ufb01culty of rounding. Here instead of pursuing a rounding algorithm that output the\nfactor A and Y , we look for a convex relaxation that preserves the intrinsic complexity of the set\nwhich enables the trivial sampling encoding. It turns out that controlling the width/complexity of\nthe convex relaxation boils down to proving concentration inequalities with sum-of-squares (SoS)\nproofs, which is conceptually easier than rounding.\nTherefore, the remaining challenge is to design convex set Q that simultaneously has the following\nproperties (a) is a convex relaxation in the sense of satisfying condition (4.3). (b) admits an ef\ufb01cient\noptimization algorithm. (c) has low complexity (that is, sampling rademacher width \u02dcO(N poly(k))).\nConcretely, we have the following theorem. We note that these three properties (with Lemma 4.2)\nimply that Algorithm 1 with Q = Qsos\np and \u21e2 = O(k2r2/pd1/2 \u00b7 log d) gives a group encoding-\ndecoding pair with average encoding length O(k2r2/p/2 \u00b7 log d) and bias .\nTheorem 4.3. For every p > 4, let N = dc0p with a suf\ufb01ciently large absolute constant c0. Then,\nthere exists a convex set Qsos\np \u21e2 Rd\u21e5N such that (a) it satis\ufb01es condition 4.3; (b) The optimiza-\ntion (4.2) and (2) are solvable by semide\ufb01nite programming with run-time nO(p2); (c) the sampling\np is bounded bypSRW m(Q) 6 \u02dcO(k2r2/pN/m).\nRademacher width of Qsos\n5 Conclusions\n\nWe have de\ufb01ned a new framework for unsupervised learning which replaces generative assumptions\nby notions of reconstruction error and encoding length. This framework is comparative, and allows\nlearning of particular hypothesis classes with respect to an unknown distribution by other hypothesis\nclasses. We demonstrate its usefulness by giving new polynomial time algorithms for two unsuper-\nvised hypothesis classes. First, we give new polynomial time algorithms for dictionary models in\nsigni\ufb01cantly broader range of parameters and assumptions. Another domain is the class of spectral\nencodings, for which we consider a new class of models that is shown to strictly encompass PCA\nand kernel-PCA. This new class is capable, in contrast to previous spectral models, learn algebraic\nmanifolds. We give ef\ufb01cient learning algorithms for this class based on convex relaxations.\n\nAcknowledgements\n\nWe thank Sanjeev Arora for many illuminating discussions and crucial observations in earlier phases\nof this work, amongst them that a representation which preserves information for all classi\ufb01ers\nrequires lossless compression.\n\n8\n\n\fReferences\n[1]\n\nJayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Tight bounds for universal compres-\nsion of large alphabets. In Proceedings of the 2013 IEEE International Symposium on Information Theory, Istanbul, Turkey, July 7-12,\n2013, pages 2875\u20132879, 2013.\n\n[2] Michal Aharon, Michael Elad, and Alfred Bruckstein. K-svd: Design of dictionaries for sparse representation. In IN: PROCEEDINGS\n\nOF SPARS05, pages 9\u201312, 2005.\n\n[3] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent\n\nvariable models. J. Mach. Learn. Res., 15(1):2773\u20132832, January 2014.\n\n[4] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, ef\ufb01cient, and neural algorithms for sparse coding. In Proceedings of\n\nThe 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 113\u2013149, 2015.\n\n[5] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models\u2013going beyond svd. In Foundations of Computer Science (FOCS),\n\n2012 IEEE 53rd Annual Symposium on, pages 1\u201310. IEEE, 2012.\n\n[6] Sanjeev Arora, Rong Ge, and Ankur Moitra. New algorithms for learning incoherent and overcomplete dictionaries. arXiv preprint\n\narXiv:1308.6273, 2013.\n\n[7] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for clustering via similarity functions.\n\nProceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC \u201908, pages 671\u2013680, 2008.\n\nIn\n\n[8] Boaz Barak, Jonathan A. Kelner, and David Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. In\nProceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17,\n2015, pages 143\u2013151, 2015.\n\n[9] Shai Ben-David and Margareta Ackerman. Measures of clustering quality: A working set of axioms for clustering.\n\nIn D. Koller,\nD. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 121\u2013128. Curran\nAssociates, Inc., 2009.\n\n[10] Toby Berger. Rate distortion theory: A mathematical basis for data compression. 1971.\n\n[11] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993\u20131022, March 2003.\n\n[12] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing).\n\nWiley-Interscience, 2006.\n\n[13] D. L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inf. Theor., 47(7):2845\u20132862, September\n\n2006.\n\n[14] Elad Hazan and Satyen Kale. Projection-free online learning. In Proceedings of the 29th International Conference on Machine Learning,\n\nICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.\n\n[15] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceed-\n\nings of the 4th conference on Innovations in Theoretical Computer Science, pages 11\u201320. ACM, 2013.\n\n[16] Nikola Jevtic, Alon Orlitsky, and Narayana P. Santhanam. A lower bound on compression of unknown alphabets. Theor. Comput. Sci.,\n\n332(1-3):293\u2013311, 2005.\n\n[17]\n\nJon M. Kleinberg. An impossibility theorem for clustering. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural\nInformation Processing Systems 15, pages 463\u2013470. MIT Press, 2003.\n\n[18] Roi Livni, David Lehavi, Sagi Schein, Hila Nachlieli, Shai Shalev-Shwartz, and Amir Globerson. Vanishing component analysis. In\n\nProceedings of the 30th International Conference on Machine Learning, ICML 2013, 2013.\n\n[19] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.\n\n[20] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM J. Comput., 24(2):227\u2013234, 1995.\n\n[21] Alon Orlitsky, Narayana P. Santhanam, and Junan Zhang. Universal compression of memoryless sources over unknown alphabets. IEEE\n\nTrans. Information Theory, 50(7):1469\u20131481, 2004.\n\n[22] Hristo S Paskov, Robert West, John C Mitchell, and Trevor Hastie. Compressive feature learning. In Advances in Neural Information\n\nProcessing Systems, pages 2931\u20132939, 2013.\n\n[23]\n\nJorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465\u2013471, 1978.\n\n[24] Ruslan Salakhutdinov. Learning Deep Generative Models. PhD thesis, University of Toronto, 2009. AAINR61080.\n\n[25] Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and Generalization with the Information Bottleneck, pages 92\u2013107. Springer\n\nBerlin Heidelberg, Berlin, Heidelberg, 2008.\n\n[26] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological),\n\n58(1):267\u2013288, 1996.\n\n[27] Naftali Tishby, Fernando C. N. Pereira, and William Bialek. The information bottleneck method. CoRR, physics/0004057, 2000.\n\n[28] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142, 1984.\n\n[29] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.\n\n[30] Van Vu. Singular vectors under random perturbation. Random Structures and Algorithms, 39(4):526\u2013538, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1651, "authors": [{"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}, {"given_name": "Tengyu", "family_name": "Ma", "institution": "Princeton University"}]}