{"title": "MixLasso: Generalized Mixed Regression via Convex Atomic-Norm Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 10868, "page_last": 10876, "abstract": "We consider a generalization of mixed regression where the response is an additive combination of several mixture components. Standard mixed regression is a special case where each response is generated from exactly one component. Typical approaches to the mixture regression problem employ local search methods such as Expectation Maximization (EM) that are prone to spurious local optima. On the other hand, a number of recent theoretically-motivated \\emph{Tensor-based methods} either have high sample complexity, or require the knowledge of the input distribution, which is not available in most of practical situations. In this work, we study a novel convex estimator \\emph{MixLasso} for the estimation of generalized mixed regression, based on an atomic norm specifically constructed to regularize the number of mixture components. Our algorithm gives a risk bound that trades off between prediction accuracy and model sparsity without imposing stringent assumptions on the input/output distribution, and can be easily adapted to the case of non-linear functions. In our numerical experiments on mixtures of linear as well as nonlinear regressions, the proposed method yields high-quality solutions in a wider range of settings than existing approaches.", "full_text": "MixLasso: Generalized Mixed Regression via\n\nConvex Atomic-Norm Regularization\n\nIan E.H. Yen \u2217\u2020 Wei-Cheng Lee \u2021\n\nPradeep Ravikumar \u2217\n\nSung-En Chang \u2021 Kai Zhong \u00a7\n\nShou-De Lin \u2021\n\n\u2217 Carnegie Mellon University\n\n\u2020 Snap Inc.\n\n\u2021 National Taiwan University\n\n\u00a7 Amazon Inc.\n\nAbstract\n\nWe consider a generalization of mixed regression where the response is an additive\ncombination of several mixture components. Standard mixed regression is a special\ncase where each response is generated from exactly one component. Typical\napproaches to the mixture regression problem employ local search methods such\nas Expectation Maximization (EM) that are prone to spurious local optima. On the\nother hand, a number of recent theoretically-motivated Tensor-based methods either\nhave high sample complexity, or require the knowledge of the input distribution,\nwhich is not available in most of practical situations. In this work, we study a novel\nconvex estimator MixLasso for the estimation of generalized mixed regression,\nbased on an atomic norm speci\ufb01cally constructed to regularize the number of\nmixture components. Our algorithm gives a risk bound that trades off between\nprediction accuracy and model sparsity without imposing stringent assumptions on\nthe input/output distribution, and can be easily adapted to the case of non-linear\nfunctions. In our numerical experiments on mixtures of linear as well as nonlinear\nregressions, the proposed method yields high-quality solutions in a wider range of\nsettings than existing approaches.\n\n1\n\nIntroduction\n\nThe Mixed Regression (MR) problem considers the estimation of K functions from a collection of\ninput-output samples, where for each sample, the output is generated by one of the K regression\nfunctions. When \ufb01tting linear functions in a noiseless setting, this is equivalent to solving K\nlinear systems, while at the same time, identifying which system each equation belongs to. The\nMR formulation can be employed as an approach to decompose a complicated function into K\nsimpler ones, by splitting the observations into K classes. Variants of regression families such as\npiecewise-linear regression can be viewed as special cases of MR.\nHowever, the MR problem is NP-hard in general [1] due to the simultaneous \ufb01tting of the discrete\nclass labels as well as the regression functions. Standard approaches to the mixture problem employ\nlocal search methods such as Expectation Maximization (EM) [2] and Variational Bayes [3] that are\nprone to spurious local optima. There have thus been several lines of recent work studying estimation\nof mixed regression models with strong statistical guarantees under additional statistical assumptions.\nFor the special case of linear function with K=2 components, [4] propose a convex nuclear norm\nminimization formulation that is guaranteed to estimate the two functions with minimax-optimal rates\nwhen given a sub-Gaussian design matrix. With the additional conditions of zero noise and isotropic\nGaussian inputs, [1] propose an initialization for the EM algorithm to guarantee exact recovery of\nthe true parameters. However, in addition to the stringent statistical assumptions, these methods and\nresults are specialized to the case of two components, and seem non-trivial to generalize.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFor problems with more than two components, most of the existing approaches [5, 6, 7, 8] rely\non the Tensor Methods.\nIn particular, for a D-dimensional linear MR problem, [6] propose a\nconvex optimization formulation using a third-order tensor, which results in a computational cost\nof O(N D12) and a sample complexity of O(D6/\u00012), limiting its application to problems of small\ndimension. The Tensor Decomposition approach proposed in [5] has a sample complexity of only\nO(D3K 4/\u00012) and is computationally ef\ufb01cient. However, it requires the knowledge of the input\nprobability distribution in order to derive the score function used in their algorithm, which might not\nbe available, and estimating the density over the D-dimensional input variables could be an even\nharder problem than MR itself. Other recent work [7, 8] show that in the noiseless setting with\nisotropic Gaussian inputs, an Alternating Minimization algorithm initialized with the Tensor Method\nleads to exact recovery of the true parameters. These latter methods have sample complexities linear\nin D, but with O(K K), O(K 10) dependencies in K respectively. Finally, [9] observed that, under the\nassumption of well-separated data, one can use a guaranteed clustering algorithm to \ufb01nd the mixture\nassignment of each observation, and thus solves the MR problem as a by-product. However, the data\ndistribution considered in MR, such as those assumed in [5, 6, 7, 8], are usually not well-separated\n(see our Figure 3 as an example).\nIn this work, we address a generalized version of Mixed Regression where the output can be an\nadditive combination of several mixture components. Our approach follows the general meta-approach\nemerging in the recent years of addressing latent-variable model estimation from the perspective of\nhigh-dimensional sparse estimation [10, 11, 12]. We propose a novel convex estimator MixLasso for\nthe mixed regression problem, which enforces the mixture structure through minimizing a carefully\nconstructed atomic norm that acts as a surrogate function for the number of mixture components.\nWe then propose a greedy algorithm that generates a steepest-descent component at each iteration\nthrough solving a sub-problem similar to MAX-CUT. Our analysis of the algorithm gives a risk\nbound that trades off prediction accuracy and model sparsity, with a sample complexity that is linear\nin both D and K, and without imposing any stringent assumptions on, or assuming knowledge of, the\ninput/output distribution beyond that of boundedness, and even allowing for model mis-speci\ufb01cation.\nThis makes our MixLasso algorithm a theoretically sound method for a wide range of practical\nsettings. Moreover, we also show how our proposed method can be easily extended to the nonlinear\nregression setting, to regression functions lying in a Reproducing Kernel Hilbert Space (RKHS).\nOur experiments with both generalized MR and standard MR show that the proposed method \ufb01nds\nhigh-quality solutions in a wider range of settings when compared to existing approaches.\n\nK(cid:88)\n\nk=1\n\n2 Generalized Mixed Regression\nIn Generalized Mixed Regression, the response y \u2208 R, given covariates x \u2208 X , is speci\ufb01ed as:\n\ny =\n\nzkfk(x) + \u03c9\n\n(1)\nwhere zk \u2208 {0, 1}, k = 1, . . . , K is a latent binary vector indicating the presence or absence of each\ncomponent, and fk(xi) : RD \u2192 R is the regression function of k-th component. The standard mixed\nregression is a special case of (1) with additional constraint (cid:107)z(cid:107)0 = 1. Here \u03c9 \u2208 R is a noise term\nwith both bias and variance. In other words, we consider the very general setting where we allow\nfor model mis-speci\ufb01cation, and in general E[\u03c9|x, z] (cid:54)= 0. This makes our problem setting in (1)\nvery practically plausible, especially when the regression functions {fk(x)}K\nk=1 lie in some restricted\nfamily such as linear functions. Our goal is to \ufb01nd F := {fk(x)}K\n\nk=1 minimizing the risk\n\n(cid:34)\n\n(cid:35)\n\n(y \u2212 K(cid:88)\n\nk=1\n\nr(F) := E\n\nmin\n\nz\u2208{0,1}K\n\n1\n2\n\nzkfk(x))2\n\n,\n\n(2)\n\nwhile keeping the number of components K as small as possible. This yields a trade-off between\nr(F) and K. While one can always have a small risk with K \u2192 \u221e, we would like to \ufb01nd the\nsmallest K that achieves such risk.\n\n3 MixLasso: Convex Estimation via Atomic Norm\nIn the following, we will \ufb01rst focus on the linear case fk(x) := (cid:104)wk, x(cid:105) and consider extension\nto nonlinear functions in Section 4.2. Given a collection of i.i.d. samples {(xi, yi)}N\ni=1, the (cid:96)2-\n\n2\n\n\fregularized Empirical Risk Minimization (ERM) problem for our task (2) is\n(cid:107)W(cid:107)2\nF .\n\n(yi \u2212 zT\n\ni W xi)2 +\n\nmin\n\nW\u2208RK\u00d7D,zi\u2208{0,1}K\n\n\u03c4\n2\n\nN(cid:88)\n\ni=1\n\n1\n2N\n\n(3)\n\nN(cid:88)\n(cid:80)N\ni=1 \u03b1\u2217\n\ni=1\n\n(3) is a hard optimization problem in general due to the simultaneous minimization w.r.t. parameters\nW and binary hidden variables {zi}N\ni=1 [1]. However, given hidden variables, the problem is convex\nw.r.t. W , and thus, from the duality theory (3) is equivalent to\n\nmin\n\nZ\u2208{0,1}N\u00d7K\n\nmax\n\u03b1\u2208RN\n\n\u22121\nN\n\ntr(D(\u03b1)XX TD(\u03b1)ZZ T)\n\nL\u2217(yi,\u2212\u03b1i) \u2212 1\n2N 2\u03c4\n\n(4)\n2(cid:107)\u03b1(cid:107)2\n2 (y \u2212 \u03be)2. The maximizer \u03b1\u2217 of (4) and minimizer\ni ) = 1\n\ni=1, D(\u03b1) is a diagonal matrix formed by vector \u03b1, and L\u2217(y, \u03b1) = yT\u03b1 + 1\n\nwhere Z := (zi)N\nis the convex conjugate of square loss L(y, \u03be) = 1\nW \u2217 of (3) are related by W \u2217 = 1\nA key observation for our formulation is that, although (4) is non-convex w.r.t. Z, it is a convex\nfunction of M := ZZ T (since it is a maximum over linear functions of M). Therefore, the\nintractability of (4) only lies in the combinatorial constraint M = ZZ T for some Z \u2208 {0, 1}N\u00d7K.\nTo relax such constraint, we introduce an atomic norm [13] of the form\n\nN \u03c4 Z TD(\u03b1\u2217)X.\n\ni (zixT\n\nN \u03c4\n\n(cid:107)M(cid:107)S := min\nc\u22650\n\nca s.t. M =\n\nwhere S := {zzT|z \u2208 {0, 1}N}. Note if ca takes integer values {0, 1}, M =(cid:80)\n\na\u2208S caa = ZZ T\nfor some Z \u2208 {0, 1}N\u00d7K and (cid:107)M(cid:107)S = K. When ca is allowed to be any nonnegative number, (5)\nserves as a convex approximation to the number of components K in a sense similar to (cid:96)1-norm\nas a convex approximation for the number of non-zero elements in Lasso [14]. Then the MixLasso\nestimator minimizes minM\u2208RN\u00d7N\n\ng(M ) + \u03bb(cid:107)M(cid:107)S where\n\ncaa.\n\na\u2208S\n\na\u2208S\n\n(5)\n\n(cid:88)\n\n(cid:88)\n\n+\n\ng(M ) := max\n\u03b1\u2208RN\n\n\u2212 1\n2N 2\u03c4\n\ntr(D(\u03b1)XX TD(\u03b1)M ) \u2212 1\nN\n\n4 Algorithm\n\nL\u2217(yi,\u2212\u03b1i)\n\n(6)\n\n(cid:18) \u00afK(cid:88)\n\n(cid:19)\n\nThe convex formulation (6) is still a challenging optimization problem since it involves an atomic\nnorm de\ufb01ned over \u00afK := 2N atoms. An equivalent formulation expresses (6) as the minimizatioin of\n\nF (c) := g\n\nckzkzkT\n\n+ \u03bb(cid:107)c(cid:107)1\n\n(7)\n\n+ , where {zk} \u00afK\n\nw.r.t. c \u2208 R \u00afK\nk=1 enumerates \u2200z \u2208 {0, 1}N . We introduce a greedy algorithm\n(Algorithm 1) for MixLasso, which maintains a sparse set of active components and adds one more\nactive component zkzkT at each iteration corresponding to the steepest descent direction\n\nk=1\n\n(cid:104)\u2207g(M ), zzT(cid:105) = \u2212 1\n2N 2\u03c4\n\nmin\n\nz\u2208{0,1}N\n\n(8)\nwhere \u03b1\u2217 is the maximizer in (6). As we show in Section 4.1, (8) is equivalent to a MAX-CUT\nlike problem that can be solved ef\ufb01ciently with a constant-ratio approximation guarantee. Then\nwe minimize (7) w.r.t. coef\ufb01cients corresponding to the active components through a sequence of\nproximal gradient updates:\n\nz\u2208{0,1}N\n\nmax\n\n(cid:104)D(\u03b1\u2217)XX TD(\u03b1\u2217), zzT(cid:105),\n\nk \u2190\ncs+1\n\n(9)\nfor k \u2208 A, and s = 1 . . . S, where \u03b3 is the Lipschitz-continuous parameter of the coordinate-wise\ngradient zkT\u2207g(M )zk. The evaluation of \u2207g(M s) involves \ufb01nding the maximizer \u03b1\u2217, which can\nbe obtained by solving the least-square problem:\n\n\u03b3|A| (zkT\u2207g(M s)zk + \u03bb)\n\nk \u2212 1\ncs\n\n+\n\n(cid:20)\n\nN(cid:88)\n\ni=1\n\n(cid:21)\n\nW \u2217 := argmin\nW\u2208R|A|\u00d7D\n\n1\n2N\n\n(yi \u2212 zT\n\ni W xi)2 +\n\ntr(W TD\u22121(cA)W )\n\n\u03c4\n2\n\n(10)\n\nN(cid:88)\n\ni=1\n\n3\n\n\fAlgorithm 1 A Greedy Algorithm for MixLasso (6)\n\nInitialize A = \u2205, c = 0.\nfor t = 1...T do\n\n1. Find a greedy component zzT by solving (8).\n2. Add zzT to the active set A.\n3. Minimize (7) w.r.t. coordinates cA in the active set A through updates (9).\n4. Eliminate {zkzkT|ck = 0} from A.\n\nend for.\n\ni = (yi \u2212 zT\n\ni W \u2217xi). Let E be the N \u00d7 (|A|D) design matrix of the least-square\nand compute \u03b1\u2217\nproblem (10). By maintaining E, ETE whenever the active set A changes, solving the least-square\nproblem (10) costs O(D3|A|3) amortizedly.\n\n4.1 Greedy Generation of Components\n\nProblem (8) for \ufb01nding the steepest descent direction is a convex maximization problem with binary-\nvalued variables and is hard in general. However, we show that it is equivalent to a Boolean Quadratic\nMaximization problem similar to MAX-CUT, where constant-ratio approximate algorithm exists\nthrough a Semide\ufb01nite Relaxation [15]. Furthermore, the Semide\ufb01nie Relaxation of this type has\nscalable solver that requires only complexity linear to the coef\ufb01cient matrix [16, 17].\nLet C = D(\u03b1\u2217)XX TD(\u03b1\u2217).\nthe form\nmaxz\u2208{0,1}N(cid:104)C, zzT(cid:105), which can be reduced to a problem of binary variables v \u2208 {\u22121, 1}N\nvia a transformation v = 2z \u2212 1:\n\nThe greedy step (8) solves a problem of\n\nwhere 1 denotes N-dimensional vector of all 1s. By introducing a dummy variable v0, (11) is\nequivalent to\n\n(cid:0)(cid:104)C, vvT(cid:105) + 2(cid:104)C, 1vT(cid:105) + (cid:104)C, 11T(cid:105)(cid:1) .\n(cid:20) v0\n(cid:21)(cid:20) v0\n\n(cid:21)T(cid:20) 1T C1 1T C\n\nv\n\nC1\n\nC\n\nv\n\n(cid:21)\n\nmax\n\nv\u2208{\u22121,1}N\n\n1\n4\n\nmax\n\n(v0;v)\u2208{\u22121,1}N +1\n\n1\n4\n\n(11)\n\n.\n\n(12)\n\nNote one can always \ufb01nd a solution of v0 = 1 by \ufb02ipping signs of the solution since this does not\nchange the objective value. Let the matrix in (12) be \u02c6C. Problem of form (12) is a Boolean Quadratic\nproblem similar to MAX-CUT, for which there is Semide\ufb01nite relaxation of the form\n\nmax\nV \u2208SN\ns.t.\n\n(cid:104) \u02c6C, V (cid:105)\nV (cid:23) 0, diag(V ) = 1\n\n(13)\n\nand rounding from which guarantees a solution \u02c6v to (12) satisfying h \u2212 h(\u02c6v) \u2264 \u03c1(h \u2212 h) with\n\u03c1 = 2/5 [15], where h(v) denotes the objective function of (12) and h, h denote the maximum and\nminimum of the objective in (12) respectively. Note this result holds for any symmetric matrix \u02c6C.\nSince our problem has a positive-semide\ufb01nite matrix \u02c6C, we have h = 0 and therefore the component\nzk found this way satis\ufb01es\n\n\u2212zkT\u2207g(M )zk = h(\u02c6v) \u2265 \u00b5h = \u00b5 max\nz\u2208{0,1}N\n\n(14)\nwith \u00b5 = 1 \u2212 \u03c1 = 3/5. Semide\ufb01nite Programming of the form (13) allows specialized solver with\niteration cost linear to the matrix size nnz( \u02c6C) [16, 17]. And it is worth mentioning that, since our\nmatrix \u02c6C has low-rank structure (8), our implementation of the SDP solver [17] can further reduce\nthe complexity per iteration from nnz( \u02c6C) to nnz(X).\n\n\u2212zT\u2207g(M )z\n\n4.2 Nonlinear Extension\n\nA simple way to consider a nonlinear version of the MixLasso estimator is to consider each component\nfk(x) lying in a Reproducing Kernel Hilbert Space (RKHS) H with respect to some Mercer kernel\n\n4\n\n\fK(\u00b7,\u00b7). In this setting, given {zi}N\n\n(cid:32)\ni=1, the minimizer {f\u2217\nN(cid:88)\nyi \u2212 K(cid:88)\n\nzikfk(xi)\n\n(cid:33)2\nk}K\n\nk=1 of\n\nK(cid:88)\n\n+\n\n\u03c4\n2\n\n(cid:107)fk(cid:107)2H\n\nmin\nfk\u2208H\n\n1\n2N\n\n(cid:80)N\nsatis\ufb01es the condition of the Representer Theorem that ensures an expression of the form f\u2217\ni=1 \u03b1izikK(xi, x), k \u2208 [K], for the minimizer, and results in a MixLasso estimator (6) with\n\nk=1\n\nk=1\n\ni=1\n\nk (x) =\n\n(15)\n\nN(cid:88)\n\ni=1\n\ng(M ) := max\n\u03b1\u2208RN\n\n\u2212 1\n2N 2\u03c4\n\ntr(D(\u03b1)QD(\u03b1)M ) \u2212 1\nN\n\nL\u2217(yi,\u2212\u03b1i)\n\n(16)\n\nwhere Q : N \u00d7 N is the kernel matrix with Qij = K(xi, xj). Then Algorithm 1 can be applied\nwith the only difference on the evaluation of gradient \u2207g(M ), which requires \ufb01nding the maximizer\nN \u03c4 Q \u25e6 M + I)\u03b1 = y. where \u25e6 denotes the\n\u03b1\u2217 of (16) by solving the following linear system: ( 1\nelementwise product.\n\n4.3 Rounding Procedure for Generalized & Standard Mixed Rregression\nWhile the atomic-norm regularization \u03bb(cid:107)M(cid:107)S is a good convex relaxation of the number of compo-\nnents, the number of non-zero components getting from estimator (6) cannot be precisely speci\ufb01ed\napriori by the hyper-parameter \u03bb directly. In practice, it is often useful to obtain a solution c with\nexactly (cid:107)c(cid:107)0 = K non-zeros. This can be achieved by setting the K coef\ufb01cients of largest magnitude\nto 1 and all the other coef\ufb01cients to 0. This results in a N \u00d7 K matrix of hidden assignments \u02c6Z as the\noutput of Algorithm 1. Then, starting from \u02c6Z, we can perform a number of alternating minimization\nsteps between model parameters W (or {fk}K\ni=1 until\nconvergence, as in a standard EM algorithm (with MAP hard assignment on zi).\nWhile we have proposed a solution of the generalized version (1), in some applications, it might be\nof interest to solve the special case of standard mixed regression, where each observation belongs\nto exactly one mixture component. One approach to convert a generalized mixture solution with K\ncomponents to a standard mixture of J components is to \ufb01nd the most frequent J patterns z1, z2, ..., zJ\nfrom the estimated hidden assignments {\u02c6zi}N\ni=1, and then force each observation to choose their\nj=1 instead of arbitrary 0-1 patterns {0, 1}K. This\nhidden assignments {zi}N\ni=1 from the set {zj}J\nresults in J functions {fj}J\nk=1 zjkfk(x), j \u2208 [J], being actually used in\nthe training observation, and thus gives a valid model {fj}J\nj=1 of standard mixed regression with\nJ components. Then as noted previously, one can further re\ufb01ne this rounded solution through EM\niterates of standard mixed regression, initialized with component functions {fj}J\n\nj=1 of the form fj(x) =(cid:80)K\n\nk=1 in general) and hidden assignments {zi}N\n\nj=1.\n\n5 Analysis\n\n5.1 Convergence Analysis\nWe assume y and x are bounded such that |y| \u2264 Ry, (cid:107)x(cid:107)2 \u2264 Rx. And without loss of generality, we\nassume the data are scaled such that Ry = Rx = 1. Then the following theorem guarantees the rate\nof convergence for Algorithm 1 up to a certain precision determined by the approximation ratio given\nin (14).\nTheorem 1. Let F (c) be the objective (7). The greedy algorithm (Algorithm 1) satis\ufb01es\n\nF (cT ) \u2212 F (c\u2217) \u2264 2\u03b3(cid:107)c\u2217(cid:107)2\n\n1\n\n(17)\n\u00b52\nfor any iterate T satisfying F (cT ) \u2212 F (c\u2217) \u2265 2(1\u2212\u00b5)\n\u00b5 \u03bb(cid:107)c\u2217(cid:107)1, where c\u2217 is any reference solution,\n\u00b5 = 3/5 is the approximation ratio given by (14) and \u03b3 is the Lipschitz-continuous constant of the\ncoordinate-wise gradient zkT\u2207g(M )zk, \u2200k \u2208 [ \u00afK].\nThen the following lemma shows that, with the additional assumption that F (c) is strongly convex\nover a restricted support set A\u2217, one can get a bound in terms of the (cid:96)0-norm of the reference solution.\n\nT\n\n.\n\n(cid:18) 1\n\n(cid:19)\n\n5\n\n\fFigure 1: Results for Noiseless Mixture of Linear Regression with N (0, I) input distribution (Top)\nand U (\u22121, 1) input distribution (Bottom), where (Left) D=100, K=3, (Middle) D=20, K=10, and\n(Right) Generalized Mixture of Regression with D=20, K=3.\n\nFigure 2: Results for Noisy (\u03c3 = 0.1) Mixture of Linear Regression with N (0, I) input distribution\n(Top) and U (\u22121, 1) input distribution (Bottom), where (Left) D=100, K=3, (Middle) D=20, K=10,\nand (Right) Generalized Mixture of Regression with D=20, K=3.\n\nLemma 1. Let A\u2217 \u2208 [ \u00afK] be a support set and c\u2217 := arg minc:supp(c)=A\u2217 F (c\u2217). Suppose F (c) is\n\nstrongly convex on A\u2217 with parameter \u03b2. We have (cid:107)c\u2217(cid:107)1 \u2264(cid:113) 2(cid:107)c\u2217(cid:107)0(F (0)\u2212F (c\u2217))\n\n.\n\n\u03b2\n\n(cid:115)\n\n(cid:18) 1\n\n(cid:19)\n\nT\n\n2(1 \u2212 \u00b5)\u03bb\n\n\u00b5\n\n+\n\n2(cid:107)c\u2217(cid:107)0\n\n\u03b2\n\n.\n\n(18)\n\n(cid:80)N\n\nSince F (0) \u2212 F (c\u2217) \u2264 1\n\n2N\n\ni=1 y2\n\ni \u2264 1 , from (17), we have\n\nF (cT ) \u2212 F (c\u2217) \u2264 4\u03b3(cid:107)c\u2217(cid:107)0\n\n\u03b2\u00b52\n\nfor any c\u2217 := arg minc:supp(c)=A\u2217 F (c).\n\n5.2 Generalization Analysis\n\nIn this section, we investigate the performance of output from Algorithm 1 in terms of the risk (2).\n\u221a\nGiven a coef\ufb01cients c with support A, we can construct the weight matrix by \u02c6W (c) = D(\ncA)W\nwith W = Z TAD(\u03b1\u2217)X, where ZA = (zk)k\u2208A and \u03b1\u2217 is the maximizer in (6) as a function of c.\nFrom the duality between (3) and (4), \u02c6W satis\ufb01es\n(yi \u2212 zT\n\nF + \u03bb(cid:107)cA(cid:107)1.\n\nN(cid:88)\n\nF (cA) =\n\n\u02c6W xi)2 +\n\ni\n\n(cid:107)W(cid:107)2\n\n(19)\n\n1\n2N\n\n\u03c4\n2\n\ni=1\n\n6\n\n05001000150020002500300035004000N00.020.040.060.080.10.120.140.160.180.2Parameter-ErrorNormal-D=100-K=3ALT-randomALT-tensorEM-randomEM-tensorMixLasso00.511.52N\u00d710400.050.10.150.20.250.30.350.4Parameter-ErrorNormal-D=20-K=10ALT-randomALT-tensorEM-randomEM-tensorMixLasso050010001500N00.050.10.150.20.250.30.350.40.450.5Parameter-ErrorGeneralized-Normal-D=20-K=3EM-randomMixLasso05001000150020002500300035004000N00.020.040.060.080.10.120.140.160.180.2Parameter-ErrorUniform-D=100-K=3ALT-randomALT-tensorEM-randomEM-tensorMixLasso00.511.52N\u00d710400.050.10.150.20.250.30.350.4Parameter-ErrorUniform-D=20-K=10ALT-randomALT-tensorEM-randomEM-tensorMixLasso050010001500N00.10.20.30.40.50.6Parameter-ErrorGeneralized-Uniform-D=20-K=3EM-randomMixLasso05001000150020002500300035004000N00.020.040.060.080.10.120.140.160.180.2Parameter-ErrorNormal-Noisy-D=100-K=3ALT-randomALT-tensorEM-randomEM-tensorMixLasso00.511.52N\u00d710400.050.10.150.20.250.30.350.4Parameter-ErrorNormal-Noisy-D=20-K=10ALT-randomALT-tensorEM-randomEM-tensorMixLasso050010001500N00.050.10.150.20.250.30.350.40.450.5Parameter-ErrorGeneralized-Normal-Noisy-D=20-K=3EM-randomMixLasso05001000150020002500300035004000N00.020.040.060.080.10.120.140.160.180.2Parameter-ErrorUniform-Noisy-D=100-K=3ALT-randomALT-tensorEM-randomEM-tensorMixLasso00.511.52N\u00d710400.050.10.150.20.250.30.350.4Parameter-ErrorUniform-Noisy-D=20-K=10ALT-randomALT-tensorEM-randomEM-tensorMixLasso050010001500N00.050.10.150.20.250.30.350.40.450.5Parameter-ErrorGeneralized-Uniform-Noisy-D=20-K=3EM-randomMixLasso\f\u00b52\u03b2 ( K\n\n\u00013 log( RK\n\n\u0001 ) and N = \u2126( DK\n\nThe following theorem gives a risk bound for the output weight matrix \u02c6W (c) from Algorithm 1.\nTheorem 2. Let A, \u02c6c, \u02c6W be the set of active components, coef\ufb01cients and corresponding weight\nmatrix obtained from T iterations of Algorithm 1, and \u00afW be the minimizer of the population risk\n(2) with K components and (cid:107) \u00afW(cid:107)F \u2264 R. We have r( \u02c6W ) \u2264 r( \u00afW ) + \u0001 with probability 1 \u2212 \u03c1 for\nT \u2265 4\u03b3\nNote the output of Algorithm 1 has number of components \u02c6K \u2264 T . Therefore, Theorem 2 gives\na trade-off between the suboptimality of risk r( \u02c6W ) \u2212 r( \u00afW ) \u2264 \u0001 and number of components\n\u02c6K = O(K/\u0001). Note the result of Theorem (2) is obtained without distributional assumption on\nthe input/output (except boundedness), so it is in general not possible to guarantee convergence to\nan optimal risk with exactly K components, since \ufb01nding such optimal solution is NP-hard even\nmeasured by the empirical risk [1]. It remains open if one can give a tighter result for the estimator (6)\nthat achieves \u0001-suboptimal risk with number of components being a constant multiple of K, or derive\na bound on the parameter estimation error, possibly with additional assumptions on the observations.\n\n\u0001\u03c1 )) with \u03bb, \u03c4 chosen appropriately as functions of N.\n\n6 Experiments\n\ni=1 and {fk(x)}K\n\nIn this section, we compare the proposed MixLasso method with other state-of-the-art approaches\nlisted as follows. (i) EM-Random: A standard EM algorithm that alternates between minimizing\n{zi}N\nk=1 until convergence, with random initialized W\u223c N (0, I) in the linear case\nand random initialized Z\u223c M ultinoulli(1/K) in the nonlinear case. Each point in the \ufb01gures is\nthe best result out of 100 random trials. (ii) EM-Tensor: The EM algorithm initialized with Tensor\nMethod proposed in [8]. The formula of Tensor Method is derived assuming xi \u223c N (0, I). We adopt\nimplementation provided by the author of [7]. (iii) ALT-Random: An Alternating Minimization\nalgorithm proposed in [7] with the same initialization strategy and number of trails as EM-Random.\n(iv) ALT-Tensor: The Alternating Minimization algorithm initialized with Tensor Method proposed\nin [7]. The formula of Tensor Method is derived assuming xi \u223c N (0, I). We adopt implementation\nprovided by the author of [7]. (v) MixLasso: The proposed estimator with Algorithm 1. We round\nour solution to exact K components according to the rounding procedure described in Section 4.3 for\ngeneralized MR and standard MR respectively. The rounded solution is further re\ufb01ned by EM iterates.\nFor the linear case, we compare methods using the root mean square error on the learned parameters\nW compared to the ground-truth parameters W \u2217 of size K \u00d7 D: minS:|S|=K\n, where\nS denotes a multiset that selects the best matched row in W for each row in W \u2217. For the nonlinear\ncase, we compare methods using RMSE between the predicted value and the ground-truth function\nvalue:\n\nk=1 zikfk(xi) \u2212(cid:80)K\n\n(cid:80)N\ni=1((cid:80)K\n\n(cid:107)WS,:\u2212W \u2217(cid:107)F\n\n(cid:113)\n\nk=1 z\u2217\n\nikf\u2217\n\nk (xi))2.\n\n\u221a\n\nDK\n\n1\nN\n\n6.1 Experiments on Synthetic Data\n\nwhere Syn1\u223cSyn12 are generated by D-dimensional linear models fk(x) = wT\n\nWe generate 14 synthetic data sets according to the model: yi =(cid:80)K\nare generated by 1-dimensional polynomial model of degree 6: fk(x) =(cid:80)6\n\nk=1 zikfk(x) + \u03c9i, i \u2208 [N ],\nk x and Syn13\u223cSyn14\nj=1 wkjxj. Figure 1 and 2\ngive experimental results of the linear model in the noiseless and noisy case respectively. We observe\nthat, in the case of Normal input distribution (Syn1, Syn2, Syn7, Syn8) (top row), both the Tensor-\ninitialized methods and MixLasso consistently improve upon random-initialized EM/ALT (even\nwith 100 trials) in terms of the number of samples required to achieve a good performance, where\nALT performs better than EM in higher dimensional case (D = 100, K = 3) whileEM performs\nbetter for cases of more components (D = 20, K = 10); meanwhile, MixLasso leads to signi\ufb01cant\nimprovements in both cases. On the other hand, when the input distribution becomes U(-1,1) (Syn4,\nSyn5, Syn10, Syn11), the tensor-initialized method becomes even worse than the random-initialized\nones, presumably due to the model mis-speci\ufb01cation, while MixLasso still consistently improve\nupon the random initialized EM/ALT. Note we are testing Tensor Method derived based on the\nNormal assumption on data with Uniform input on purpose. The fgoal is to see the effect of model\nmisspeci\ufb01cation on the Tensor approach, as in practice one would always have model misspeci\ufb01cation\nto some degree. The rightmost columns of Figure 1, 2 show the results on data generated from\nthe generalized mixed regression model (Syn3, Syn6, Syn9, Syn12), where Tensor-based methods\n\n7\n\n\fFigure 3: Results on Mixture of 6th-order Polynomial Regression of K=4 components with noise\n(Bottom) and without noise (Top). (Left) The best result of EM out of 100 random initialization.\n(Middle) Solution from MixLasso followed by \ufb01ne-tuning EM iterates. (Right) Comparison in terms\nof RMSE.\n\nFigure 4: Results of \ufb01tting mixture of polynomial regressions on the Stock data set of increasing num-\nber of samples. The top row shows results \ufb01tted by EM, and the bottom row shows that from MixLasso.\nFrom left to right we have (left) 100 weeks, (middle) 200 weeks, and (right) 300 weeks. From left to\nright, the RMSE of EM=(6.33, 6.04, 6.27) and the RMSE of MixLasso=(6.29, 5.75, 5.58).\n\nare not applicable, while MixLasso improves upon EM-Random by a large margin. Figure 3 gives\na comparison of EM-Random and MixLasso on Mixture of Kernel Regression with polynomial\nkernel K(xi, xj) = (axT\ni xj + b)d (d = 6), where we generate K=4 random 6th-degree polynomial\nfunctions {f\u2217\nk=1 by uniform sampling their coef\ufb01cients from U (\u22124, 4). In this setting, we found\nEM-Random has a hard time converging to the ground-truth solution even with 100-restarts, while\nMixLasso obtains solution close to the ground truth with a small number of samples.\n\nk}K\n\n6.2 Experiments on Real Data\n\nIn this section, we compare MixLasso and EM (with 100 restarts) for \ufb01tting a mixture of polynomial\nregressions on a Stock data set that contains the mixed stock prices of IBM, Facebook, Microsoft and\nNvidia of span 300 weeks till the Feb. of 2018. The task is to automatically recover the company label\nof each stock price, while \ufb01tting the stock price time series of each company as a polynomial curve.\nBoth EM and MixLasso use a polynomial kernel of the parameters: K(xi, xj) = (2xT\ni xj + 2)8. The\nresults are shown in Figure 4. We can see that MixLasso almost recovers the pattern when all samples\nare given, except for a small number of samples generated by Nvidia\u2019s rapid growth recently. While\nMixLasso consistently achieving a lower RMSE over different sample sizes, the RMSE gap between\nMixLasso and EM increases as the number of samples grows.\n\nAcknowledgements. P.R. acknowledges the support of NSF via IIS-1149803, IIS-1664720, DMS-\n1264033, and ONR via N000141812861.\n\n8\n\n-1-0.8-0.6-0.4-0.200.20.40.60.81-10-5051015-1-0.8-0.6-0.4-0.200.20.40.60.81-10-50510150100200300400500600700800900N00.10.20.30.40.50.60.7RMSEPolyKernel-Deg=6-K=4EM-randomLasso-1-0.8-0.6-0.4-0.200.20.40.60.81-10-505101520-1-0.8-0.6-0.4-0.200.20.40.60.81-10-50510150100200300400500600700800900N00.10.20.30.40.50.60.7RMSEPolyKernel-Noisy-Deg=6-K=4EM-randomMixLasso00.10.20.30.40.50.60.70.80.91050100150200250Normalized week numbersOpen priceStock\u2212Separation\u2212EM (N=400) IBMFBMicrosoftNvidia00.10.20.30.40.50.60.70.80.91050100150200250Normalized week numbersOpen priceStock\u2212Separation\u2212EM (N=800) IBMFBMicrosoftNvidia00.10.20.30.40.50.60.70.80.91050100150200250Normalized week numbersOpen priceStock\u2212Separation\u2212EM (N=1200) IBMFBMicrosoftNvidia00.10.20.30.40.50.60.70.80.91050100150200Normalized week numbersOpen priceStock\u2212Separation\u2212MixLasso (N=400) IBMFBMicrosoftNvidia00.10.20.30.40.50.60.70.80.91050100150200Normalized week numbersOpen priceStock\u2212Separation\u2212MixLasso (N=800) IBMFBMicrosoftNvidia00.10.20.30.40.50.60.70.80.91050100150200Normalized week numbersOpen priceStock\u2212Separation\u2212MixLasso (N=1200) IBMFBMicrosoftNvidia\fReferences\n[1] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed\n\nlinear regression. In ICML, pages 613\u2013621, 2014.\n\n[2] Lei Xu, Michael I Jordan, and Geoffrey E Hinton. An alternative model for mixtures of experts.\n\nIn Advances in neural information processing systems, pages 633\u2013640, 1995.\n\n[3] Christopher M Bishop and Markus Svenskn. Bayesian hierarchical mixtures of experts. In\nProceedings of the Nineteenth conference on Uncertainty in Arti\ufb01cial Intelligence, pages 57\u201364.\nMorgan Kaufmann Publishers Inc., 2002.\n\n[4] Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed\n\nregression with two components: Minimax optimal rates. In COLT, pages 560\u2013604, 2014.\n\n[5] Hanie Sedghi, Majid Janzamin, and Anima Anandkumar. Provable tensor methods for learning\nmixtures of generalized linear models. In Arti\ufb01cial Intelligence and Statistics, pages 1223\u20131231,\n2016.\n\n[6] Arun T Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions.\nIn Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages\n1040\u20131048, 2013.\n\n[7] Kai Zhong, Prateek Jain, and Inderjit S Dhillon. Mixed linear regression with multiple compo-\n\nnents. In Advances in Neural Information Processing Systems, pages 2190\u20132198, 2016.\n\n[8] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Solving a mixture of many ran-\ndom linear equations by tensor decomposition and alternating minimization. arXiv preprint\narXiv:1608.05749, 2016.\n\n[9] Paul Hand and Babhru Joshi. A convex program for mixed linear regression with a recovery\n\nguarantee for well-separated data. arXiv preprint arXiv:1612.06067, 2016.\n\n[10] Ian En-Hsu Yen, Xin Lin, Kai Zhong, Pradeep Ravikumar, and Inderjit Dhillon. A convex\nexemplar-based approach to mad-bayes dirichlet process mixture models. In International\nConference on Machine Learning, pages 2418\u20132426, 2015.\n\n[11] Ian En-Hsu Yen, Xin Lin, Jiong Zhang, Pradeep Ravikumar, and Inderjit Dhillon. A convex\natomic-norm approach to multiple sequence alignment and motif discovery. In International\nConference on Machine Learning, pages 2272\u20132280, 2016.\n\n[12] Ian En-Hsu Yen, Wei-Chen Chang, Sung-En Chang, Arun Sai Suggula, Shou-De Lin, and\nPradeep. Ravikumar. Latent feature lasso. International Conference on Machine Learning,\n2017.\n\n[13] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex\ngeometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805\u2013\n849, 2012.\n\n[14] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 267\u2013288, 1996.\n\n[15] Yurii Nesterov et al. Quality of semide\ufb01nite relaxation for nonconvex quadratic optimization.\nUniversit\u00e9 Catholique de Louvain. Center for Operations Research and Econometrics [CORE],\n1997.\n\n[16] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex burer-monteiro ap-\nproach works on smooth semide\ufb01nite programs. In Advances in Neural Information Processing\nSystems, pages 2757\u20132765, 2016.\n\n[17] Po-Wei Wang, Wei-Cheng Chang, and J Zico Kolter. The mixing method: coordinate descent\n\nfor low-rank semide\ufb01nite programming. arXiv preprint arXiv:1706.00476, 2017.\n\n9\n\n\f", "award": [], "sourceid": 6961, "authors": [{"given_name": "Ian En-Hsu", "family_name": "Yen", "institution": "Carnegie Mellon University"}, {"given_name": "Wei-Cheng", "family_name": "Lee", "institution": "National Taiwan University"}, {"given_name": "Kai", "family_name": "Zhong", "institution": "Amazon"}, {"given_name": "Sung-En", "family_name": "Chang", "institution": "Northeastern University"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "Carnegie Mellon University"}, {"given_name": "Shou-De", "family_name": "Lin", "institution": "National Taiwan University"}]}