{"title": "QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2006, "page_last": 2014, "abstract": "In this paper, we develop a family of algorithms for optimizing superposition-structured\u201d or \u201cdirty\u201d statistical estimators for high-dimensional problems involving the minimization of the sum of a smooth loss function with a hybrid regularization. Most of the current approaches are first-order methods, including proximal gradient or Alternating Direction Method of Multipliers (ADMM). We propose a new family of second-order methods where we approximate the loss function using quadratic approximation. The superposition structured regularizer then leads to a subproblem that can be efficiently solved by alternating minimization. We propose a general active subspace selection approach to speed up the solver by utilizing the low-dimensional structure given by the regularizers, and provide convergence guarantees for our algorithm. Empirically, we show that our approach is more than 10 times faster than state-of-the-art first-order approaches for the latent variable graphical model selection problems and multi-task learning problems when there is more than one regularizer. For these problems, our approach appears to be the first algorithm that can extend active subspace ideas to multiple regularizers.\"", "full_text": "QUIC & DIRTY: A Quadratic Approximation\n\nApproach for Dirty Statistical Models\n\nCho-Jui Hsieh, Inderjit S. Dhillon, Pradeep Ravikumar\n\nUniversity of Texas at Austin\n\nAustin, TX 78712 USA\n\n{cjhsieh,inderjit,pradeepr}@cs.utexas.edu\n\nStephen Becker\n\nUniversity of Colorado at Boulder\n\nBoulder, CO 80309 USA\n\nstephen.becker@colorado.edu\n\nPeder A. Olsen\n\nIBM T.J. Watson Research Center\nYorktown Heights, NY 10598 USA\n\npederao@us.ibm.com\n\nAbstract\n\nIn this paper, we develop a family of algorithms for optimizing \u201csuperposition-\nstructured\u201d or \u201cdirty\u201d statistical estimators for high-dimensional problems involv-\ning the minimization of the sum of a smooth loss function with a hybrid reg-\nularization. Most of the current approaches are \ufb01rst-order methods, including\nproximal gradient or Alternating Direction Method of Multipliers (ADMM). We\npropose a new family of second-order methods where we approximate the loss\nfunction using quadratic approximation. The superposition structured regularizer\nthen leads to a subproblem that can be ef\ufb01ciently solved by alternating minimiza-\ntion. We propose a general active subspace selection approach to speed up the\nsolver by utilizing the low-dimensional structure given by the regularizers, and\nprovide convergence guarantees for our algorithm. Empirically, we show that our\napproach is more than 10 times faster than state-of-the-art \ufb01rst-order approaches\nfor the latent variable graphical model selection problems and multi-task learning\nproblems when there is more than one regularizer. For these problems, our ap-\nproach appears to be the \ufb01rst algorithm that can extend active subspace ideas to\nmultiple regularizers.\n\n1\n\nIntroduction\n\nFrom the considerable amount of recent research on high-dimensional statistical estimation, it has\nnow become well understood that it is vital to impose structural constraints upon the statistical model\nparameters for their statistically consistent estimation. These structural constraints take the form of\nsparsity, group-sparsity, and low-rank structure, among others; see [18] for uni\ufb01ed statistical views\nof such structural constraints.\nIn recent years, such \u201cclean\u201d structural constraints are frequently\nproving insuf\ufb01cient, and accordingly there has been a line of work on \u201csuperposition-structured\u201d or\n\u201cdirty model\u201d constraints, where the model parameter is expressed as the sum of a number of param-\neter components, each of which have their own structure. For instance, [4, 6] consider the estimation\nof a matrix that is neither low-rank nor sparse, but which can be decomposed into the sum of a low-\nrank matrix and a sparse outlier matrix (this corresponds to robust PCA when the matrix-structured\nparameter corresponds to a covariance matrix). [5] use such matrix decomposition to estimate the\nstructure of latent-variable Gaussian graphical models. [15] in turn use a superposition of sparse\nand group-sparse structure for multi-task learning. For other recent work on such superposition-\nstructured models, see [1, 7, 14]. For a uni\ufb01ed statistical view of such superposition-structured\nmodels, and the resulting classes of M-estimators, please see [27].\n\nConsider a general superposition-structured parameter \u00af\u03b8 := (cid:80)k\n\nr=1 are the\nparameter-components, each with their own structure. Let {R(r)(\u00b7)}k\nr=1 be regularization functions\nsuited to the respective parameter components, and let L(\u00b7) be a (typically non-linear) loss function\n\nr=1 \u03b8(r), where {\u03b8(r)}k\n\n1\n\n\fthat measures the goodness of \ufb01t of the superposition-structured parameter \u00af\u03b8 to the data. We now\nhave the notation to consider a popular class of M-estimators studied in the papers above for these\nsuperposition-structured models:\nL\n\n+(cid:88)\n\n(cid:18)(cid:88)\n\n\u03bbrR(r)(\u03b8(r))\n\n:= F (\u03b8),\n\n(cid:26)\n\n(cid:27)\n\n(cid:19)\n\n(1)\n\nmin\n{\u03b8(r)}k\n\nr=1\n\n\u03b8(r)\n\nr\n\nr\n\non the sum \u00af\u03b8 :=(cid:80)k\n\nwhere {\u03bbr}k\nr=1 are regularization penalties. In (1), the overall regularization contribution is sepa-\nrable in the individual parameter components, but the loss function term itself is not, and depends\nr=1 \u03b8(r). Throughout the paper, we use \u00af\u03b8 to denote the overall superposition-\n\nstructured parameter, and \u03b8 =[\u03b8(1),. . . ,\u03b8(k)] to denote the concatenation of all the parameters.\nDue to the wide applicability of this class of M-estimators in (1), there has been a line of work on\ndeveloping ef\ufb01cient optimization methods for solving special instances of this class of M-estimators\n[14, 26], in addition to the papers listed above. In particular, due to the superposition-structure in\n(1) and the high-dimensionality of the problem, this class seems naturally amenable to a proximal\ngradient descent approach or the ADMM method [2, 17]; note that these are \ufb01rst-order methods and\nare thus very scalable.\nIn this paper, we consider instead a proximal Newton framework to minimize the M-estimation ob-\njective in (1). Speci\ufb01cally, we use iterative quadratic approximations, and for each of the quadratic\nsubproblems, we use an alternating minimization approach to individually update each of the pa-\nrameter components comprising the superposition-structure. Note that the Hessian of the loss might\nbe structured, as for instance with the logdet loss for inverse covariance estimation and the logistic\nloss, which allows us to develop very ef\ufb01cient second-order methods. Even given this structure,\nsolving the regularized quadratic problem in order to obtain the proximal Newton direction is too\nexpensive due to the high dimensional setting. The key algorithmic contribution of this paper is in\ndeveloping a general active subspace selection framework for general decomposable norms, which\nallows us to solve the proximal Newton steps over a signi\ufb01cantly reduced search space. We are\nable to do so by leveraging the structural properties of decomposable regularization functions in the\nM-estimator in (1).\nOur other key contribution is theoretical. While recent works [16, 21] have analyzed the conver-\ngence of proximal Newton methods, the superposition-structure here poses a key caveat: since the\nloss function term only depends on the sum of the individual parameter components, the Hessian is\nnot positive-de\ufb01nite, as is required in previous analyses of proximal Newton methods. The theoret-\nical analysis [9] relaxes this assumption by instead assuming the loss is self-concordant but again\nallows at most one regularizer. Another key theoretical dif\ufb01culty is our use of active subspace se-\nlection, where we do not solve for the vanilla proximal Newton direction, but solve the proximal\nNewton step subproblem only over a restricted subspace, which moreover varies with each step. We\ndeal with these issues and show super-linear convergence of the algorithm when the sub-problems\nare solved exactly. We apply our algorithm to two real world applications: latent Gaussian Markov\nrandom \ufb01eld (GMRF) structure learning (with low-rank + sparse structure), and multitask learning\n(with sparse + group sparse structure), and demonstrate that our algorithm is more than ten times\nfaster than state-of-the-art methods.\nOverall, our algorithmic and theoretical developments open up the state of the art but forbidding\nclass of M-estimators in (1) to very large-scale problems.\nOutline of the paper. We begin by introducing some background in Section 2. In Section 3,\nwe propose our quadratic approximation framework with active subspace selection for general dirty\nstatistical models. We derive the convergence guarantees of our algorithm in Section 4. Finally, in\nSection 5, we apply our model to solve two real applications, and show experimental comparisons\nwith other state-of-the-art methods.\n2 Background and Applications\nDecomposable norms. We consider the case where all the regularizers {R(r)}k\nr=1 are decompos-\nable norms (cid:107) \u00b7 (cid:107)Ar. A norm (cid:107) \u00b7 (cid:107) is decomposable at x if there is a subspace T and a vector e \u2208 T\nsuch that the sub differential at x has the following form:\n\n\u2202(cid:107)x(cid:107)r = {\u03c1 \u2208 Rn | \u03a0T (\u03c1) = e and (cid:107)\u03a0T \u22a5(\u03c1)(cid:107)\u2217\nAr\n\n(2)\nwhere \u03a0T (\u00b7) is the orthogonal projection onto T , and (cid:107)x(cid:107)\u2217 := sup(cid:107)a(cid:107)\u22641(cid:104)x, a(cid:105) is the dual norm of\n(cid:107) \u00b7 (cid:107). The decomposable norm was de\ufb01ned in [3, 18], and many interesting regularizers belong to\nthis category, including:\n\n\u2264 1},\n\n2\n\n\fgroups, say G = {G1, . . . , GNG}, and de\ufb01ne the (1,\u03b1)-group norm by (cid:107)x(cid:107)1,\u03b1 :=(cid:80)NG\n\n\u2022 Sparse vectors: for the (cid:96)1 regularizer, T is the span of all points with the same support as x.\n\u2022 Group sparse vectors: suppose that the index set can be partitioned into a set of NG disjoint\nt=1 (cid:107)xGt(cid:107)\u03b1. If\nSG denotes the subset of groups where xGt (cid:54)= 0, then the subgradient has the following form:\n\n\u2202(cid:107)x(cid:107)1,\u03b1 := {\u03c1 | \u03c1 = (cid:88)\n\nt\u2208SG\n\nxGt/(cid:107)xGt(cid:107)\u2217\n\nmt},\n\n\u03b1 + (cid:88)\n\nt /\u2208SG\n\nwhere (cid:107)mt(cid:107)\u2217\n\n\u03b1 \u2264 1 for all t /\u2208 SG. Therefore, the group sparse norm is also decomposable with\n\n(3)\n\u2022 Low-rank matrices: for the nuclear norm regularizer (cid:107) \u00b7 (cid:107)\u2217, which is de\ufb01ned to be the sum of\nsingular values, the subgradient can be written as\n\nT := {x | xGt = 0 for all t /\u2208 SG}.\n\n\u2202(cid:107)X(cid:107)\u2217 = {U V T + W | U T W = 0, W V = 0,(cid:107)W(cid:107)2 \u2264 1},\n\nj }k\n\ni,j=1) where {ui}k\n\nwhere (cid:107) \u00b7 (cid:107)2 is the matrix 2 norm and U, V are the left/right singular vectors of X corresponding to\nnon-zero singular values. The above subgradient can also be written in the decomposable form (2),\nwhere T is de\ufb01ned to be span({uivT\ni=1 are the columns of U and V .\nApplications. Next we discuss some widely used applications of superposition-structured models,\nand the corresponding instances of the class of M-estimators in (1).\n\u2022 Gaussian graphical model with latent variables: let \u0398 denote the precision matrix with corre-\nsponding covariance matrix \u03a3 = \u0398\u22121. [5] showed that the precision matrix will have a low rank\n+ sparse structure when some random variables are hidden, thus \u0398 = S \u2212 L can be estimated by\nsolving the following regularized MLE problem:\n\ni=1,{vi}k\n\n\u2212 log det(S \u2212 L) + (cid:104)S \u2212 L, \u03a3(cid:105) + \u03bbS(cid:107)S(cid:107)1 + \u03bbL trace(L).\n\n(4)\n\nmin\n\nS,L:L(cid:23)0,S\u2212L(cid:31)0\n\nWhile proximal Newton methods have recently become a dominant technique for solving the (cid:96)1-\nregularized log-determinant problems [12, 10, 13, 19], our development is the \ufb01rst to apply proximal\nNewton methods to solve log-determinant problems with sparse and low rank regularizers.\n\u2022 Multi-task learning: given k tasks, each with sample matrix X (r) \u2208 Rnr\u00d7d (nr samples in the\nr-th task) and labels y(r), [15] proposes minimizing the following objective:\n(cid:96)(y(r), X (r)(S(r) + B(r))) + \u03bbS(cid:107)S(cid:107)1 + \u03bbB(cid:107)B(cid:107)1,\u221e,\n\nk(cid:88)\n\n(5)\n\nr=1\n\nwhere (cid:96)(\u00b7) is the loss function and S(r) is the r-th column of S.\n\u2022 Noisy PCA: to recover a covariance matrix corrupted with sparse noise, a widely used technique\nis to solve the matrix decomposition problem [6]. In contrast to the squared loss above, an exponen-\ntial PCA problem [8] would use a Bregman divergence for the loss function.\n3 Our proposed framework\n\nTo perform a Newton-like step, we iteratively form quadratic approximations of the smooth loss\nfunction. Generally the quadratic subproblem will have a large number of variables and will be hard\nto solve. Therefore we propose a general active subspace selection technique to reduce the problem\nsize by exploiting the structure of the regularizers R1, . . . ,Rk.\n\n3.1 Quadratic Approximation\nGiven k sets of variables \u03b8 = [\u03b8(1), . . . , \u03b8(k)], and each \u03b8(r) \u2208 Rn, let \u2206(r) denote perturbation of\nr=1 \u03b8(r)) = L(\u00af\u03b8) to be the loss function,\nr=1 R(r)(\u03b8(r)) to be the regularization. Given the current estimate \u03b8, we form the\n\n\u03b8(r), and \u2206 = [\u2206(1), . . . , \u2206(k)]. We de\ufb01ne g(\u03b8) := L((cid:80)k\nand h(\u03b8) :=(cid:80)k\n\nquadratic approximation of the smooth loss function:\n\n\u00afg(\u03b8 + \u2206) = g(\u03b8) +\n\n(6)\nwhere G = \u2207L(\u00af\u03b8) is the gradient of L and H is the Hessian matrix of g(\u03b8). Note that \u2207\u00af\u03b8L(\u00af\u03b8) =\n\u2207\u03b8(r)L(\u00af\u03b8) for all r so we simply write \u2207 and refer to the gradient at \u00af\u03b8 as G (and similarly for \u22072).\nBy the chain rule, we can show that\n\nr=1\n\n(cid:104)\u2206(r), G(cid:105) +\n\n\u2206TH\u2206,\n\n1\n2\n\nk(cid:88)\n\n3\n\n\fLemma 1. The Hessian matrix of g(\u03b8) is\n\nH := \u22072g(\u03b8) =\n\n\uf8ee\uf8f0H \u00b7\u00b7\u00b7 H\n\n...\n...\nH \u00b7\u00b7\u00b7 H\n\n...\n\n\uf8f9\uf8fb , H := \u22072L(\u00af\u03b8).\n\n(7)\n\nIn this paper we focus on the case where H is positive de\ufb01nite. When it is not, we add a small\nconstant \u0001 to the diagonal of H to ensure that each block is positive de\ufb01nite.\nNote that the full Hessian, H, will in general, not be positive de\ufb01nite (in fact rank(H) = rank(H)).\nHowever, based on its special structure, we can still give convergence guarantees (along with rate of\nconvergence) for our algorithm. The Newton direction d is de\ufb01ned to be:\n\n[d(1), . . . , d(k)] = argmin\n\n\u00afg(\u03b8 + \u2206) +\n\n\u2206(1),...,\u2206(k)\n\nr=1\n\n\u03bbr(cid:107)\u03b8(r) + \u2206(r)(cid:107)Ar := QH(\u2206; \u03b8).\n\n(8)\n\nThe quadratic subproblem (8) cannot be directly separated into k parts because the Hessian matrix\n(7) is not a block-diagonal matrix. Also, each set of parameters has its own regularizer, so it is hard\nto solve them all together. Therefore, to solve (8), we propose a block coordinate descent method.\nAt each iteration, we pick a variable set \u2206(r) where r \u2208 {1, 2, . . . , k} by a cyclic (or random) order,\nand update the parameter set \u2206(r) while keeping other parameters \ufb01xed. Assume \u2206 is the current\nsolution (for all the variable sets), then the subproblem with respect to \u2206(r) can be written as\n\nk(cid:88)\n\n\u2206(r) \u2190 argmin\nd\u2208Rn\n\n2 dT Hd + (cid:104)d, G + (cid:88)\n\n1\n\nt:r(cid:54)=t\n\nH\u2206(t)(cid:105) + \u03bbr(cid:107)\u03b8(r) + d(cid:107)Ar .\n\n(9)\n\nThe subproblem (9) is just a typical quadratic problem with a speci\ufb01c regularizer, so there already\nexist ef\ufb01cient algorithms for solving it for different choices of (cid:107) \u00b7 (cid:107)A. For the (cid:96)1 norm regularizer,\ncoordinate descent methods can be applied to solve (9) ef\ufb01ciently as used in [12, 21]; (accelerated)\nproximal gradient descent or projected Newton\u2019s method can also be used, as shown in [19]. For a\ngeneral atomic norm where there might be in\ufb01nitely many atoms (coordinates), a greedy coordinate\ndescent approach can be applied, as shown in [22].\n\nTo iterate between different groups of parameters, we have to maintain the term(cid:80)k\n\nr=1 H\u2206(r) during\nthe Newton iteration. Directly computing H\u2206(r) requires O(n2) \ufb02ops; however, the Hessian matrix\noften has a special structure so that H\u2206(r) can be computed ef\ufb01ciently. For example, in the inverse\ncovariance estimation problem H = \u0398\u22121 \u2297 \u0398\u22121 where \u0398\u22121 is the current estimate of covariance,\nand in the empirical risk minimization problem H = XDX T where X is the data matrix and D is\ndiagonal.\nAfter solving the subproblem (8), we have to search for a suitable stepsize. We apply an Armijo\nrule for line search [24], where we test the step size \u03b1 = 20, 2\u22121, ... until the following suf\ufb01cient\ndecrease condition is satis\ufb01ed for a pre-speci\ufb01ed \u03c3 \u2208 (0, 1) (typically \u03c3 = 10\u22124):\n\nk(cid:88)\n\n\u03bbr(cid:107)\u0398r + \u03b1\u2206(r)(cid:107)Ar \u2212 k(cid:88)\n\nr=1\n\nr=1\n\n\u03bbr(cid:107)\u03b8(r)(cid:107)Ar .\n\n(10)\n\nF (\u03b8 + \u03b1\u2206) \u2264 F (\u03b8) + \u03b1\u03c3\u03b4, \u03b4 = (cid:104)G, \u2206(cid:105) +\n\n3.2 Active Subspace Selection\n\nSince the quadratic subproblem (8) contains a large number of variables, directly applying the above\nquadratic approximation framework is not ef\ufb01cient. In this subsection, we provide a general active\nsubspace selection technique, which dramatically reduces the size of variables by exploiting the\nstructure of regularizers. A similar method has been discussed in [12] for the (cid:96)1 norm and in [11]\nfor the nuclear norm, but it has not been generalized to all decomposable norms. Furthermore, a key\npoint to note is that in this paper our active subspace selection is not only a heuristic, but comes with\nstrong convergence guarantees that we derive in Section 4.\nGiven the current \u03b8, our subspace selection approach partitions each \u03b8(r) into S (r)\n(S (r)\n\ufb01xed)\u22a5 and then restricts the search space of the Newton direction in (8) within S (r)\nthe following quadratic approximation problem:\n\n\ufb01xed and S (r)\nfree =\nfree, which yields\n\n[d(1), . . . , d(k)] =\n\n\u2206(1)\u2208S(1)\n\nargmin\nfree ,...,\u2206(k)\u2208S(k)\n\nfree\n\n\u00afg(\u03b8 + \u2206) +\n\n4\n\nk(cid:88)\n\nr=1\n\n\u03bbr(cid:107)\u03b8(r) + \u2206(r)(cid:107)Ar .\n\n(11)\n\n\fEach group of parameter has its own \ufb01xed/free subspace, so we now focus on a single parameter\ncomponent \u03b8(r). An ideal subspace selection procedure would satisfy:\n\nProperty (I). Given the current iterate \u03b8, any updates along directions in the \ufb01xed set, for instance\n\nas \u03b8(r) \u2190 \u03b8(r) + a, a \u2208 S (r)\n\nProperty (II). The subspace Sfree converges to the support of the \ufb01nal solution in a \ufb01nite number of\n\n\ufb01xed, does not improve the objective function value.\n\niterations.\n\nSuppose given the current iterate, we \ufb01rst do updates along directions in the \ufb01xed set, and then do\nupdates along directions in the free set. Property (I) ensures that this is equivalent to ignoring updates\nalong directions in the \ufb01xed set in this current iteration, and focusing on updates along the free set.\nAs we will show in the next section, this property would suf\ufb01ce to ensure global convergence of our\nprocedure. Property (II) will be used to derive the asymptotic quadratic convergence rate.\nWe will now discuss our active subspace selection strategy which will satisfy both properties above.\nConsider the parameter component \u03b8(r), and its corresponding regularizer (cid:107) \u00b7 (cid:107)Ar. Based on the\nde\ufb01nition of decomposable norm in (2), there exists a subspace Tr where \u03a0Tr(\u03c1) is a \ufb01xed vector\nfor any subgradient of (cid:107) \u00b7 (cid:107)Ar. The following proposition explores some properties of the sub-\ndifferential of the overall objective F (\u03b8) in (1).\nProposition 1. Consider any unit-norm vector a, with (cid:107)a(cid:107)Ar = 1, such that a \u2208 T \u22a5\nr .\n\n(a) The inner-product of the sub-differential \u2202\u03b8(r) F (\u03b8) with a satis\ufb01es:\n(cid:104)a, \u2202\u03b8(r) F (\u03b8)(cid:105) \u2208 [(cid:104)a, G(cid:105) \u2212 \u03bbr,(cid:104)a, G(cid:105) + \u03bbr].\n\n(12)\n\n(b) Suppose |(cid:104)a, G(cid:105)| \u2264 \u03bbr. Then, 0 \u2208 argmin\u03c3 F (\u03b8 + \u03c3a).\n\nSee Appendix 7.8 for the proof. Note that G = \u2207L(\u00af\u03b8) denotes the gradient of L. The proposition\nthus implies that if |(cid:104)a, G(cid:105)| \u2264 \u03bbr and S (r)\nthen Property (I) immediately follows. The\ndif\ufb01culty is that the set {a | |(cid:104)a, G(cid:105)| \u2264 \u03bbr} is possibly hard to characterize, and even if we could\ncharacterize this set, it may not be amenable enough for the optimization solvers to leverage in order\nto provide a speedup. Therefore, we propose an alternative characterization of the \ufb01xed subspace:\nDe\ufb01nition 1. Let \u03b8(r) be the current iterate, prox(r)\n\n\ufb01xed \u2282 T \u22a5\n\nr\n\nprox(r)\n\n\u03bb (x) = argmin\n\n\u03bb be the proximal operator de\ufb01ned by\n1\n(cid:107)y \u2212 x(cid:107)2 + \u03bb(cid:107)y(cid:107)Ar ,\n2\n\nand Tr(x) be the subspace for the decomposable norm (2) (cid:107) \u00b7 (cid:107)Ar at point x. We can de\ufb01ne the\n\ufb01xed/free subset at \u03b8(r) as:\n\ny\n\nS (r)\n\ufb01xed := [T (\u03b8(r))]\u22a5 \u2229 [T (prox(r)\n\n(G))]\u22a5, S (r)\n\nfree = S (r)\n\n\ufb01xed\n\n\u03bbr\n\n\u22a5\n\n.\n\n(13)\n\nIt can be shown that from the de\ufb01nition of the proximal operator, and De\ufb01nition 1, it holds that\n|(cid:104)a, G(cid:105)| < \u03bbr, so that we would have local optimality in the direction a as before. We have the\nfollowing proposition:\nProposition 2. Let S (r)\n\n\ufb01xed be the \ufb01xed subspace de\ufb01ned in De\ufb01nition 1. We then have:\n\n0 = argmin\n\u2206(r)\u2208S(r)\n\n\ufb01xed\n\nQH([0, . . . , 0, \u2206(r), 0, . . . , 0]; \u03b8).\n\nWe will prove that Sfree as de\ufb01ned above converges to the \ufb01nal support in Section 4, as required in\nProperty (II) above. We will now detail some examples of the \ufb01xed/free subsets de\ufb01ned above.\n\u2022 For (cid:96)1 regularization: S\ufb01xed = span{ei | \u03b8i = 0 and |\u2207iL(\u00af\u03b8)| \u2264 \u03bb} where ei is the ith canonical\nvector.\n\u2022 For nuclear norm regularization: the selection scheme can be written as\n\nSfree = {UAM V T\n\nA | M \u2208 Rk\u00d7k},\n\n(14)\nwhere UA = span(U, Ug), VA = span(V, Vg), with \u0398 = U\u03a3V T is the thin SVD of \u0398 and Ug, Vg\nare the left and right singular vectors of prox\u03bb(\u0398\u2212\u2207L(\u0398)). The proximal operator prox\u03bb(\u00b7) in this\ncase corresponds to singular-value soft-thresholding, and can be computed by randomized SVD or\nthe Lanczos algorithm.\n\n5\n\n\f\u2022 For group sparse regularization: in the (1, 2)-group norm case, let SG be the nonzero groups,\nthen the \ufb01xed groups FG can be de\ufb01ned by FG := {i | i /\u2208 SG and (cid:107)\u2207LGi(\u00af\u03b8)(cid:107) \u2264 \u03bb}, and the free\nsubspace will be\n(15)\nIn Figure 3 (in the appendix) that the active subspace selection can signi\ufb01cantly improve the speed\nfor the block coordinate descent algorithm [20].\nAlgorithm 1: QUIC & DIRTY: Quadratic Approximation Framework for Dirty Statistical Mod-\nels\n\nSfree = {\u03b8 | \u03b8i = 0 \u2200i \u2208 FG}.\n\n: Loss function L(\u00b7), regularizers \u03bbr(cid:107) \u00b7 (cid:107)Ar for r = 1, . . . , k, and initial iterate \u03b80.\n\nInput\nOutput: Sequence {\u03b8t} such that {\u00af\u03b8t} converges to \u00af\u03b8(cid:63).\n\nr=1 \u03b8(r)\n\nt\n\n.\n\nCompute \u00af\u03b8t \u2190(cid:80)k\n\n1 for t = 0, 1, . . . do\n2\n3\n4\n5\n6\n7\n8\n\nCompute \u2207L(\u00af\u03b8t).\nCompute Sfree by (13).\nfor sweep = 1, . . . , Touter do\nfor r = 1, . . . , k do\n\nUpdate(cid:80)k\n\nSolve the subproblem (9) within S (r)\nfree.\n\nr=1 \u22072L(\u00af\u03b8t)\u2206(r).\n\n9\n10\n\nFind the step size \u03b1 by (10).\n\u03b8(r) \u2190 \u03b8(r) + \u03b1\u2206(r) for all r = 1, . . . , k.\n\n4 Convergence\nThe recently developed theoretical analysis of proximal Newton methods [16, 21] cannot be directly\napplied because (1) we have the active subspace selection step, and (2) the Hessian matrix for each\nquadratic subproblem is not positive de\ufb01nite. We \ufb01rst prove the global convergence of our algorithm\nwhen the quadratic approximation subproblem (11) is solved exactly. Interestingly, in our proof\nwe show that the active subspace selection can be modeled within the framework of the Block\nCoordinate Gradient Descent algorithm [24] with a carefully designed Hessian approximation, and\nby making this connection we are able to prove global convergence.\nTheorem 1. Suppose L(\u00b7) is convex (may not be strongly convex), and the quadratic subproblem\n(8) at each iteration is solved exactly, Algorithm 1 converges to the optimal solution.\nThe proof is in Appendix 7.1. Next we consider the case that L(\u00af\u03b8) is strongly convex. Note that\nr=1 \u03b8(r)) will not be strongly convex in \u03b8\n(if k > 1) and there may exist more than one optimal solution. However, we show that all solutions\n\neven when L(\u00af\u03b8) is strongly convex with respect to \u00af\u03b8, L((cid:80)k\ngive the same \u00af\u03b8 :=(cid:80)k\nr=1 x(r) =(cid:80)k\n(1), then(cid:80)k\n\nLemma 2. Assume L(\u00b7) is strongly convex, and {x(r)}k\n\nr=1 are two optimal solutions of\n\nr=1,{y(r)}k\n\nr=1 y(r).\n\nr=1 \u03b8(r).\n\nThe proof is in Appendix 7.2. Next, we show that S (r)\nfree (from De\ufb01nition 1) will converge to the \ufb01nal\nsupport \u00afT (r) for each parameter set r = 1, . . . , k. Let \u00af\u03b8(cid:63) be the global minimizer (which is unique\nas shown in Lemma 2), and assume that we have\n\n(16)\nThis is the generalization of the assumption used in earlier literature [12] where only (cid:96)1 regulariza-\ntion was considered. The condition is similar to strict complementary in linear programming.\nTheorem 2. If L(\u00b7) is strongly convex and assumption (16) holds, then there exists a \ufb01nite T > 0\nsuch that S (r)\n\nf ree = \u00afT (r) \u2200r = 1, . . . , k after t > T iterations.\n\nAr\n\n< \u03bbr \u2200r = 1, . . . , k.\n\nThe proof is in Appendix 7.3. Next we show that our algorithm has an asymptotic quadratic conver-\ngence rate (the proof is in Appendix 7.4).\nTheorem 3. Assume that \u22072L(\u00b7) is Lipschitz continuous, and assumption (16) holds. If at each iter-\nation the quadratic subproblem (8) is solved exactly, and L(\u00b7) is strongly convex, then our algorithm\nconverges with asymptotic quadratic convergence rate.\n\n6\n\n(cid:107)\u03a0( \u00afT (r))\u22a5(cid:0)\u2207L(\u00af\u03b8(cid:63))(cid:1)(cid:107)\u2217\n\n\f5 Applications\nWe demonstrate that our algorithm is extremely ef\ufb01cient for two applications: Gaussian Markov\nRandom Fields (GMRF) with latent variables (with sparse + low rank structure) and multi-task\nlearning problems (with sparse + group sparse structure).\n\n5.1 GMRF with Latent Variables\n\nWe \ufb01rst apply our algorithm to solve the latent feature GMRF structure learning problem in eq (4),\nwhere S \u2208 Rp\u00d7p is the sparse part, L \u2208 Rp\u00d7p is the low-rank part, and we require L = LT (cid:23)\n0, S = ST and Y = S \u2212 L (cid:31) 0 (i.e. \u03b8(2) = \u2212L). In this case, L(Y ) = \u2212 log det(Y ) + (cid:104)\u03a3, Y (cid:105),\nhence\n\n\u22072L(Y ) = Y \u22121 \u2297 Y \u22121, and \u2207L(Y ) = \u03a3 \u2212 Y \u22121.\n\n(17)\nFor the sparse part, the free subspace is a subset of indices {(i, j) | Sij (cid:54)=\nActive Subspace.\n0 or |\u2207ijL(Y )| \u2265 \u03bb}. For the low-rank part, the free subspace can be presented as {UAM V T\nA |\nM \u2208 Rk\u00d7k} where UA and VA are de\ufb01ned in (14).\nUpdating \u2206L. To solve the quadratic subproblem (11), \ufb01rst we discuss how to update \u2206L using\nsubspace selection. The subproblem is\n\nmin\n\ntrace(\u2206LY \u22121\u2206LY \u22121)+trace((Y \u22121\u2212\u03a3\u2212Y \u22121\u2206SY \u22121)\u2206L)+\u03bbL(cid:107)L+\u2206L(cid:107)\u2217,\n\n1\n2\n\n1\n2\n\nA so that we can write \u2206L =\n\n\u2206L=U \u2206DU T :L+\u2206L(cid:23)0\nand since \u2206L is constrained to be a perturbation of L = UAM U T\nUA\u2206M U T\nA , and the subproblem becomes\nmin\n\n1\n2\n\nA Y \u22121UA and \u00af\u03a3 := U T\n\ntrace( \u00afY \u2206M \u00afY \u2206M ) + trace(\u00af\u03a3\u2206M ) + \u03bbL trace(M + \u2206M ) := q(\u2206M ),\n\n(18)\n\u2206M :M +\u2206M(cid:23)0\nA (Y \u22121 \u2212 \u03a3 \u2212 Y \u22121\u2206SY \u22121)UA. Therefore the subproblem\nwhere \u00afY := U T\n(18) becomes a k \u00d7 k dimensional problem where k (cid:28) p.\nTo solve (18), we \ufb01rst check if the closed form solution exists. Note that \u2207q(\u2206M ) = \u00afY \u2206M \u00afY +\n\u00af\u03a3 + \u03bbLI, thus the minimizer is \u2206M = \u2212 \u00afY \u22121(\u00af\u03a3 + \u03bbLI) \u00afY \u22121 if M + \u2206M (cid:23) 0. If not, we solve the\nsubproblem by the projected gradient descent method, where each step only requires O(k2) time.\nUpdating \u2206S. The subproblem with respect to \u2206S can be written as\nvec(\u2206S)T (Y \u22121\u2297Y \u22121) vec(\u2206S)+trace((\u03a3\u2212Y \u22121\u2212Y \u22121(\u2206L)Y \u22121)\u2206S)+\u03bbS(cid:107)S +\u2206S(cid:107)1,\nmin\n\u2206S\nIn our implementation we apply the same coordinate descent procedure proposed in QUIC [12] to\nsolve this subproblem.\nResults. We compare our algorithm with two state-of-the-art software packages. The LogdetPPA\nalgorithm was proposed in [26] and used in [5] to solve (4). The PGALM algorithm was proposed\nin [17]. We run our algorithm on three gene expression datasets: the ER dataset (p = 692), the\nLeukemia dataset (p = 1255), and a subset of the Rosetta dataset (p = 2000)1 For the parameters, we\nuse \u03bbS = 0.5, \u03bbL = 50 for the ER and Leukemia datasets, which give us low-rank and sparse results.\nFor the Rosetta dataset, we use the parameters suggested in LogdetPPA, with \u03bbS = 0.0313, \u03bbL =\n0.1565. The results in Figure 1 shows that our algorithm is more than 10 times faster than other\nalgorithms. Note that in the beginning PGALM tends to produce infeasible solutions (L or S \u2212 L is\nnot positive de\ufb01nite), which is not plotted in the \ufb01gures.\nOur proximal Newton framework has two algorithmic components: the quadratic approximation,\nand our active subspace selection. From Figure 1 we can observe that although our algorithm is\na Newton-like method, the time cost for each iteration is similar or even cheaper than other \ufb01rst\norder methods. The reason is (1) we take advantage from active selection, and (2) the problem has\na special structure of the Hessian (17), where computing it is no more expensive than the gradient.\nTo delineate the contribution of the quadratic approximation to the gain in speed of convergence, we\nfurther compare our algorithm to an alternating minimization approach for solving (4), together with\nour active subspace selection. Such an alternating minimization approach would iteratively \ufb01x one\nof S, L, and update the other; we defer detailed algorithmic and implementation details to Appendix\n7.6 for reasons of space. The results show that by using the quadratic approximation, we get a much\nfaster convergence rate (see Figure 2 in Appendix 7.6).\n\n1The full dataset has p = 6316 but the other methods cannot solve this size problem.\n\n7\n\n\f(a) ER dataset\n\n(b) Leukemia dataset\n\n(c) Rosetta dataset\n\nFigure 1: Comparison of algorithms on the latent feature GMRF problem using gene expression\ndatasets. Our algorithm is much faster than PGALM and LogdetPPA.\n\nTable 1: The comparisons on multi-task problems.\n\nDirty Models (sparse + group sparse)\n\ndataset\n\nUSPS\n\nRCV1\n\nnumber of\ntraining data\n\n100\n100\n400\n400\n1000\n1000\n5000\n5000\n\nrelative\nerror\n10\u22121\n10\u22124\n10\u22121\n10\u22124\n10\u22121\n10\u22124\n10\u22121\n10\u22124\n\nQUIC & DIRTY\n8.3% / 0.42s\n7.47% / 0.75s\n2.92% / 1.01s\n2.5% / 1.55s\n18.91% / 10.5s\n18.45% / 23.1s\n10.54% / 42s\n10.27% / 87s\n\nproximal gradient\n8.5% / 1.8s\n7.49% / 10.8s\n2.9% / 9.4s\n2.5% / 35.8\n18.5%/47s\n18.49% / 430.8s\n10.8% / 541s\n10.27% / 2254s\n\nADMM\n8.3% / 1.3\n7.47% / 4.5s\n3.0% / 3.6s\n2.5% / 11.0s\n18.9% / 23.8s\n18.5% / 259s\n10.6% / 281s\n10.27% / 1191s\n\nOther Models\n\nLasso Group Lasso\n8.36%\n\n10.27%\n\n4.87%\n\n22.67%\n\n13.67%\n\n2.93%\n\n20.8%\n\n12.25%\n\n5.2 Multiple-task learning with superposition-structured regularizers\nNext we solve the multi-task learning problem (5) where the parameter is a sparse matrix S \u2208 Rd\u00d7k\nand a group sparse matrix B \u2208 Rd\u00d7k. Instead of using the square loss (as in [15]), we consider the\nlogistic loss (cid:96)logistic(y, a) = log(1 + e\u2212ya), which gives better performance as seen by comparing\nTable 1 to results in [15]. Here the Hessian matrix has a special structure again: H = XDX T where\nX is the data matrix and D is the diagonal matrix, and in Appendix 7.7 we have a detail description\nof how to applying our algorithm to solve this problem.\nResults. We follow [15] and transform multi-class problems into multi-task problems. For a\nmulticlass dataset with k classes and n samples, for each r = 1, . . . , k, we generate yr \u2208 {0, 1}n\nto be the vector such that y(k)\ni = 1 if and only if the i-th sample is in class r. Our \ufb01rst dataset is the\nUSPS dataset which was \ufb01rst collected in [25] and subsequently widely used in multi-task papers.\nOn this dataset, the use of several regularizers is crucial for good performance. For example, [15]\ndemonstrates that on USPS, using lasso and group lasso regularizations together outperforms models\nwith a single regularizer. However, they only consider the squared loss in their paper, whereas we\nconsider a logistic loss which leads to better performance. For example, we get 7.47% error rate\nusing 100 samples in USPS dataset, while using the squared loss the error rate is 10.8% [15]. Our\nsecond dataset is a larger document dataset RCV1 downloaded from LIBSVM Data, which has 53\nclasses and 47,236 features. We show that our algorithm is much faster than other algorithms on both\ndatasets, especially on RCV1 where we are more than 20 times faster than proximal gradient descent.\nHere our subspace selection techniques works well because we expect that the active subspace at the\ntrue solution is small.\n\n6 Acknowledgements\n\nThis research was supported by NSF grants CCF-1320746 and CCF-1117055. C.-J.H also acknowl-\nedges support from an IBM PhD fellowship. P.R. acknowledges the support of ARO via W911NF-\n12-1-0390 and NSF via IIS-1149803, IIS-1447574, and DMS-1264033. S.R.B. was supported by\nan IBM Research Goldstine Postdoctoral Fellowship while the work was performed.\n\n8\n\n0501001509001000110012001300time (sec)Objective value Quic & DirtyPGALMLogdetPPM .01002003004005001500200025003000time (sec)Objective value Quic & DirtyPGALMLogdetPPM .0200400600\u22122000\u22121500\u22121000\u2212500time (sec)Objective value Quic & DirtyPGALMLogdetPPM .\fReferences\n[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Noisy matrix decomposition via convex relaxation:\n\nOptimal rates in high dimensions. Annals if Statistics, 40(2):1171\u20131197, 2012.\n\n[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1\u2013\n122, 2011.\n\n[3] E. Candes and B. Recht. Simple bounds for recovering low-complexity models. Mathemetical Program-\n\nming, 2012.\n\n[4] E. J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis?\n\nMach., 58(3):1\u201337, 2011.\n\nJ. Assoc. Comput.\n\n[5] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky. Latent variable graphical model selection via convex\n\noptimization. The Annals of Statistics, 2012.\n\n[6] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Rank-sparsity incoherence for matrix\n\ndecomposition. Siam J. Optim, 21(2):572\u2013596, 2011.\n\n[7] Y. Chen, A. Jalali, S. Sanghavi, and C. Caramanis. Low-rank matrix recovery from errors and erasures.\n\nIEEE Transactions on Information Theory, 59(7):4324\u20134337, 2013.\n\n[8] M. Collins, S. Dasgupta, and R. E. Schapire. A generalization of principal component analysis to the\n\nexponential family. In NIPS, 2012.\n\n[9] Q. T. Dinh, A. Kyrillidis, and V. Cevher. An inexact proximal path-following algorithm for constrained\n\nconvex minimization. arxiv:1311.1756, 2013.\n\n[10] C.-J. Hsieh, I. S. Dhillon, P. Ravikumar, and A. Banerjee. A divide-and-conquer method for sparse inverse\n\ncovariance estimation. In NIPS, 2012.\n\n[11] C.-J. Hsieh and P. A. Olsen. Nuclear norm minimization via active subspace selection. In ICML, 2014.\n[12] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estimation\n\nusing quadratic approximation. In NIPS, 2011.\n\n[13] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, P. Ravikumar, and R. A. Poldrack. BIG & QUIC: Sparse inverse\n\ncovariance estimation for a million variables. In NIPS, 2013.\n\n[14] D. Hsu, S. M. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions. IEEE Trans.\n\nInform. Theory, 57:7221\u20137234, 2011.\n\n[15] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In NIPS, 2010.\n[16] J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton-type methods for convex optimization. In NIPS,\n\n2012.\n\n[17] S. Ma, L. Xue, and H. Zou. Alternating direction methods for latent variable Gaussian graphical model\n\nselection. Neural Computation, 25(8):2172\u20132198, 2013.\n\n[18] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538\u2013557, 2012.\n\n[19] P. Olsen, F. Oztoprak, J. Nocedal, and S. Rennie. Newton-like methods for sparse inverse covariance\n\nestimation. In NIPS, 2012.\n\n[20] Z. Qin, K. Scheinberg, and D. Goldfarb. Ef\ufb01cient block-coordinate descent algorithm for the group lasso.\n\nMathematical Programming Computation, 2013.\n\n[21] K. Scheinberg and X. Tang. Practical inexact proximal quasi-newton method with global complexity\n\nanalysis. arxiv:1311.6547, 2014.\n\n[22] A. Tewari, P. Ravikumar, and I. Dhillon. Greedy algorithms for structurally constrained high dimensional\n\nproblems. In NIPS, 2011.\n\n[23] K.-C. Toh, P. Tseng, and S. Yun. A block coordinate gradient descent method for regularized convex\n\nseparable optimization and covariance selection. Mathemetical Programming, 129:331\u2013355, 2011.\n\n[24] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization.\n\nMathematical Programming, 117:387\u2013423, 2007.\n\n[25] M. van Breukelen, R. P. W. Duin, D. M. J. Tax, and J. E. den Hartog. Handwritten digit recognition by\n\ncombined classi\ufb01ers. Kybernetika, 34(4):381\u2013386, 1998.\n\n[26] C. Wang, D. Sun, and K.-C. Toh. Solving log-determinant optimization problems by a Newton-CG primal\n\nproximal point algorithm. SIAM J. Optimization, 20:2994\u20133013, 2010.\n\n[27] E. Yang and P. Ravikumar. Dirty statistical models. In NIPS, 2013.\n[28] E.-H. Yen, C.-J. Hsieh, P. Ravikumar, and I. S. Dhillon. Constant nullspace strong convexity and fast\n\nconvergence of proximal methods under high-dimensional settings. In NIPS, 2014.\n\n[29] G.-X. Yuan, C.-H. Ho, and C.-J. Lin. An improved GLMNET for L1-regularized logistic regression.\n\nJMLR, 13:1999\u20132030, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1089, "authors": [{"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UT Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}, {"given_name": "Stephen", "family_name": "Becker", "institution": "University of Colorado"}, {"given_name": "Peder", "family_name": "Olsen", "institution": "IBM"}]}