{"title": "Split LBI: An Iterative Regularization Path with Structural Sparsity", "book": "Advances in Neural Information Processing Systems", "page_first": 3369, "page_last": 3377, "abstract": "An iterative regularization path with structural sparsity is proposed in this paper based on variable splitting and the Linearized Bregman Iteration, hence called \\emph{Split LBI}. Despite its simplicity, Split LBI outperforms the popular generalized Lasso in both theory and experiments. A theory of path consistency is presented that equipped with a proper early stopping, Split LBI may achieve model selection consistency under a family of Irrepresentable Conditions which can be weaker than the necessary and sufficient condition for generalized Lasso. Furthermore, some $\\ell_2$ error bounds are also given at the minimax optimal rates. The utility and benefit of the algorithm are illustrated by applications on both traditional image denoising and a novel example on partial order ranking.", "full_text": "Split LBI: An Iterative Regularization Path with\n\nStructural Sparsity\n\nChendi Huang1, Xinwei Sun1, Jiechao Xiong1, Yuan Yao2,1\n\n1Peking University, 2Hong Kong University of Science and Technology\n\n{cdhuang, sxwxiaoxiaohehe, xiongjiechao}@pku.edu.cn, yuany@ust.hk\n\nAbstract\n\nAn iterative regularization path with structural sparsity is proposed in this paper\nbased on variable splitting and the Linearized Bregman Iteration, hence called Split\nLBI. Despite its simplicity, Split LBI outperforms the popular generalized Lasso\nin both theory and experiments. A theory of path consistency is presented that\nequipped with a proper early stopping, Split LBI may achieve model selection\nconsistency under a family of Irrepresentable Conditions which can be weaker than\nthe necessary and suf\ufb01cient condition for generalized Lasso. Furthermore, some (cid:96)2\nerror bounds are also given at the minimax optimal rates. The utility and bene\ufb01t of\nthe algorithm are illustrated by applications on both traditional image denoising\nand a novel example on partial order ranking.\n\nIntroduction\n\n1\nIn this paper, consider the recovery from linear noisy measurements of \u03b2(cid:63) \u2208 Rp, which satis\ufb01es the\nfollowing structural sparsity that the linear transformation \u03b3(cid:63) := D\u03b2(cid:63) for some D \u2208 Rm\u00d7p has most\nof its elements being zeros. For a design matrix X \u2208 Rn\u00d7p, let\n\ny = X\u03b2(cid:63) + \u0001, \u03b3(cid:63) = D\u03b2(cid:63) (S = supp (\u03b3(cid:63)) , s = |S|) ,\n\n(1.1)\nwhere \u0001 \u2208 Rn has independent identically distributed components, each of which has a sub-Gaussian\ndistribution with parameter \u03c32 (E[exp(t\u0001i)] \u2264 exp(\u03c32t2/2)). Here \u03b3(cid:63) is sparse, i.e. s (cid:28) m. Given\n(y, X, D), the purpose is to estimate \u03b2(cid:63) as well as \u03b3(cid:63), and in particular, recovers the support of \u03b3(cid:63).\nThere is a large literature on this problem. Perhaps the most popular approach is the following\n(cid:96)1-penalized convex optimization problem,\n\n(cid:18) 1\n\n(cid:19)\n\narg min\n\n\u03b2\n\n2n\n\n(cid:107)y \u2212 X\u03b2(cid:107)2\n\n2 + \u03bb(cid:107)D\u03b2(cid:107)1\n\n.\n\n(1.2)\n\nSuch a problem can be at least traced back to [ROF92] as a total variation regularization for image\ndenoising in applied mathematics; in statistics it is formally proposed by [Tib+05] as fused Lasso. As\nD = I it reduces to the well-known Lasso [Tib96] and different choices of D include many special\ncases, it is often called generalized Lasso [TT11] in statistics.\nVarious algorithms are studied for solving (1.2) at \ufb01xed values of the tuning parameter \u03bb, most of\nwhich is based on the Split Bregman or ADMM using operator splitting ideas (see for examples\n[GO09; YX11; Wah+12; RT14; Zhu15] and references therein). To avoid the dif\ufb01culty in dealing\nwith the structural sparsity in (cid:107)D\u03b2(cid:107)1, these algorithms exploit an augmented variable \u03b3 to enforce\nsparsity while keeping it close to D\u03b2.\nOn the other hand, regularization paths are crucial for model selection by computing estimators as\nfunctions of regularization parameters. For example, [Efr+04] studies the regularization path of\nstandard Lasso with D = I, the algorithm in [Hoe10] computes the regularization path of fused\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fLasso, and the dual path algorithm in [TT11] can deal with generalized Lasso. Recently, [AT16]\ndiscussed various ef\ufb01cient implementations of the the algorithm in [TT11], and the related R package\ngenlasso can be found in CRAN repository. All of these are based on homotopy method of solving\nconvex optimization (1.2).\nOur departure here, instead of solving (1.2), is to look at an extremely simple yet novel iterative\nscheme which \ufb01nds a new regularization path with structural sparsity. We are going to show that\nit works in a better way than genlasso, in both theory and experiments. To see this, de\ufb01ne a loss\nfunction which splits D\u03b2 and \u03b3,\n\n(cid:96) (\u03b2, \u03b3) :=\n\n(cid:107)y \u2212 X\u03b2(cid:107)2\n\n2 +\n\n1\n2n\n\n1\n2\u03bd\n\n(cid:107)\u03b3 \u2212 D\u03b2(cid:107)2\n\n2 (\u03bd > 0).\n\n(1.3)\n\nNow consider the following iterative algorithm,\n\n\u03b2k+1 = \u03b2k \u2212 \u03ba\u03b1\u2207\u03b2(cid:96)(\u03b2k, \u03b3k),\nzk+1 = zk \u2212 \u03b1\u2207\u03b3(cid:96)(\u03b2k, \u03b3k),\n\u03b3k+1 = \u03ba \u00b7 prox(cid:107)\u00b7(cid:107)1(zk+1),\n\n(1.4a)\n(1.4b)\n(1.4c)\nwhere the initial choice z0 = \u03b30 = 0 \u2208 Rm, \u03b20 = 0 \u2208 Rp, parameters \u03ba > 0, \u03b1 > 0, \u03bd > 0,\nand the proximal map associated with a convex function h is de\ufb01ned by proxh(z) = arg minx (cid:107)z \u2212\nx(cid:107)2/2 + h(x), which is reduced to the shrinkage operator when h is taken to be the (cid:96)1-norm,\nprox(cid:107)\u00b7(cid:107)1 (z) = S (z, 1) where\n\nS (z, \u03bb) = sign(z) \u00b7 max (|z| \u2212 \u03bb, 0) (\u03bb \u2265 0).\n\nIn fact, without the sparsity enforcement (1.4c), the algorithm is called the Landweber Iteration\nin inverse problems [YRC07], also known as L2-Boost [BY02] in statistics. When D = I and\n\u03bd \u2192 0 which enforces \u03b3 = D\u03b2 = \u03b2, the iteration (1.4) is reduced (by dropping (1.4a)) to the\npopular Linearized Bregman Iteration (LBI) for linear regression or compressed sensing which is\n\ufb01rstly proposed in [Yin+08]. The simple iterative scheme returns the whole regularization path,\nat the same cost of computing one Lasso estimator at a \ufb01xed regularization parameter using the\niterative soft-thresholding algorithm. However, LBI regularization path could be better than Lasso\nregularization path which is always biased. In fact, recently [Osh+16] shows that under nearly the\nsame conditions as standard Lasso, LBI may achieve sign-consistency but with a less biased estimator\nthan Lasso, which in the limit dynamics will reach the bias-free Oracle estimator.\nThe difference between (1.4) and the standard LBI lies in the partial sparsity control on \u03b3, which\nsplits the structural sparsity on D\u03b2 into a sparse \u03b3 and D\u03b2 by controlling their gap (cid:107)\u03b3 \u2212 D\u03b2(cid:107)2/(2\u03bd).\nThereafter algorithm (1.4) is called Split LBI in this paper.\nSplit LBI generates a sequence (\u03b2k, \u03b3k)k\u2208N which indeed de\ufb01nes a discrete regularization path.\nFurthermore, the path can be more accurate than that of generalized Lasso, in terms of Area Under\nCurve (AUC) measurement of the order of regularization paths becoming nonzero in consistent with\nthe ground truth sparsity pattern. The following simple experiment illustrates these properties.\nExample 1. Consider two problems: standard Lasso and 1-D fused Lasso.\nIn both cases, set\nn = p = 50, and generate X \u2208 Rn\u00d7p denoting n i.i.d. samples from N (0, Ip), \u0001 \u223c N (0, In),\nj = 2 (if 1 \u2264 j \u2264 10), \u22122 (if 11 \u2264 j \u2264 15), and 0 (otherwise). For Lasso\ny = X\u03b2(cid:63) + \u0001. \u03b2(cid:63)\nwe choose D = I, and for 1-D fused Lasso we choose D = [D1; D2] \u2208 R(p\u22121+p)\u00d7p such that\n(D1\u03b2)j = \u03b2j \u2212 \u03b2j+1 (for 1 \u2264 j \u2264 p \u2212 1) and D2 = Ip. The left panel of Figure 1 shows the\nregularization paths by genlasso ({D\u03b2\u03bb}) and by iteration (1.4) (linear interpolation of {\u03b3k}) with\n\u03ba = 200 and \u03bd \u2208 {1, 5, 10}, respectively. The generalized Lasso path is in fact piecewise linear with\nrespect to \u03bb while we show it along t = 1/\u03bb for a comparison. Note that the iterative paths exhibit a\nvariety of different shapes depending on the choice of \u03bd. However, in terms of order of those curves\nentering into nonzero range, these iterative paths exhibit a better accuracy than genlasso. Table 1\nshows this by the mean AUC of 100 independent experiments in each case, where the increase of \u03bd\nimproves the model selection accuracy of Split LBI paths and beats that of generalized Lasso.\n\nWhy does the simple iterative algorithm (1.4) work, even better than the generalized Lasso? In this\npaper, we aim to answer it by presenting a theory for model selection consistency of (1.4).\nModel selection and estimation consistency of generalized Lasso (1.2) has been studied in previous\nwork. [SSR12] considered the model selection consistency of the edge Lasso, with a special D in\n\n2\n\n\fFigure 1: Left shows {D\u03b2\u03bb} (t = 1/\u03bb) by genlasso and {\u03b3k} (t = k\u03b1) by Split LBI (1.4) with\n\u03bd = 1, 5, 10, for 1-D fused Lasso. Right is a comparison between our family of Irrepresentable\nCondition (IRR(\u03bd)) and IC in [Vai+13], with log-scale horizontal axis. As \u03bd grows, IRR(\u03bd) can be\nsigni\ufb01cantly smaller than IC0 and IC1, so that our model selection condition is easier to be met!\n\nTable 1: Mean AUC (with standard deviation) comparisons where Split LBI (1.4) beats genlasso.\nLeft is for the standard Lasso. Right is for the 1-D fused Lasso in Example 1.\ngenlasso\n\ngenlasso\n\nSplit LBI\n\nSplit LBI\n\n1\n\n5\n\n10\n\n1\n\n5\n\n10\n\n.9426\n(.0390)\n\n.9845\n(.0185)\n\n.9969\n(.0065)\n\n.9982\n(.0043)\n\n.9705\n(.0212)\n\n.9955\n(.0056)\n\n.9996\n(.0014)\n\n.9998\n(.0009)\n\n(1.2), which has applications over graphs. [LYY13] provides an upper bound of estimation error by\nassuming the design matrix X is a Gaussian random matrix. In particular, [Vai+13] proposes a general\ncondition called Identi\ufb01ability Criterion (IC) for sign consistency. [LST13] establishes a general\nframework for model selection consistency for penalized M-estimators, proposing an Irrepresentable\nCondition which is equivalent to IC from [Vai+13] under the speci\ufb01c setting of (1.2). In fact both of\nthese conditions are suf\ufb01cient and necessary for structural sparse recovery by generalized Lasso (1.2)\nin a certain sense.\nHowever, as we shall see soon, the bene\ufb01ts of exploiting algorithm (1.4) not only lie in its algorithmic\nsimplicity, but also provide a possibility of theoretical improvement on model selection consistency.\nBelow a new family of Irrepresentable Condition depending on \u03bd will be presented for iteration (1.4),\nunder which model selection consistency can be established. Moreover, this family can be weaker\nthan IC as the parameter \u03bd grows, which sheds light on the superb performance of Split LBI we\nobserved above. The main contributions of this paper can be summarized as follows: (A) a new\niterative regularization path with structural sparsity by (1.4); (B) a theory of path consistency which\nshows the model selection consistency of (1.4), under some weaker conditions than generalized\nLasso, together with (cid:96)2 error bounds at minimax optimal rates. Further experiments are given with\napplications on 2-D image reconstruction and partial order estimation.\n\n1.1 Notation\nFor matrix Q with m rows (D for example) and J \u2286 {1, 2, . . . , m}, let QJ = QJ,\u00b7 be the submatrix\nof Q with rows indexed by J. However, for Q \u2208 Rn\u00d7p (X for example) and J \u2286 {1, 2, . . . , p}, let\nQJ = Q\u00b7,J be the submatrix of Q with columns indexed by J, abusing the notation.\nSometimes we use (cid:104)a, b(cid:105) := aT b, denoting the inner product between vectors a, b. PL denotes the\nprojection matrix onto a linear subspace L, Let L1 + L2 := {\u03be1 + \u03be2 : \u03be \u2208 L1, \u03be \u2208 L2} for\nsubspaces L1, L2. For a matrix Q, let Q\u2020 denotes the Moore-Penrose pseudoinverse of Q, and we\nrecall that Q\u2020 = (QT Q)\u2020QT . Let \u03bbmin(Q), \u03bbmax(Q) denotes the smallest and largest singular value\n(i.e. eigenvalue if Q is symmetric) of Q. For symmetric matrices P and Q, Q (cid:31) P (or Q (cid:23) P )\nmeans that Q \u2212 P is positive (semi)-de\ufb01nite, respectively. Let Q\u2217 := QT /n.\n\n3\n\n\f2 Path Consistency of Split LBI\n\n2.1 Basic Assumptions\n\nFor the identi\ufb01ability of \u03b2(cid:63), we assume that \u03b2(cid:63) and its estimators of interest are restricted in\n\nsince replacing \u03b2(cid:63) with \u201cthe projection of \u03b2(cid:63) onto L\u201d does not change the model.\nNote that (cid:96)(\u03b2, \u03b3) is quadratic, and we can de\ufb01ne its Hessian matrix which depends on \u03bd > 0\n\nL := (ker(X) \u2229 ker(D))\u22a5 = Im(cid:0)X T(cid:1) + Im(cid:0)DT(cid:1) ,\n(cid:19)\n(cid:18)X\u2217X + DT D/\u03bd \u2212DT /\u03bd\n\nH(\u03bd) := \u22072(cid:96) (\u03b2, \u03b3) \u2261\n\n\u2212D/\u03bd\n\nIm/\u03bd\n\n.\n\n(2.1)\n\nWe make the following assumptions on H.\nAssumption 1 (Restricted Strong Convexity (RSC)). There is a constant \u03bbH > 0 such that\n\n\u2265 \u03bbH\n\n(\u03b2 \u2208 L, \u03b3S \u2208 Rs) .\n\n(2.2)\n\n(cid:0)\u03b2T , \u03b3T\n\nS\n\n(cid:19)\n\n(cid:18) \u03b2\n\n\u03b3S\n\n(cid:1) \u00b7 H(\u03b2,S),(\u03b2,S) \u00b7\n(cid:13)(cid:13)(cid:13)(cid:13)HSc,(\u03b2,S)H\n\n\u03c1\u2208[\u22121,1]s\n\nsup\n\n2\n\n\u03b3S\n\n(cid:19)(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) \u03b2\n(cid:19)(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n(cid:18)0p\n\n\u03c1\n\nRemark 1. Since the true parameter supp(\u03b3(cid:63)) = supp(D\u03b2(cid:63)) = S, it is equivalent to say that the loss\n(cid:96)(\u03b2, \u03b3) is strongly convex when restricting on the sparse subspace corresponding to support of \u03b3(cid:63).\nAssumption 2 (Irrepresentable Condition (IRR)). There is a constant \u03b7 \u2208 (0, 1] such that\n\n\u2020\n(\u03b2,S),(\u03b2,S) \u00b7\n\n\u2264 1 \u2212 \u03b7.\n\n(2.3)\n\n(cid:13)(cid:13)(cid:13)HSc,(\u03b2,S)H\n\nRemark 2. IRR here directly generalizes the Irrepresentable Condition from standard Lasso [ZY06]\nand other algorithms [Tro04], to the partial Lasso: min\u03b2,\u03b3((cid:96) (\u03b2, \u03b3) + \u03bb(cid:107)\u03b3(cid:107)1). Following the standard\nLasso, one version of the Irrepresentable Condition should be\n\n(cid:19)\n(\u03b2,S) is the value of gradient (subgradient) of (cid:96)1 penalty function (cid:107) \u00b7 (cid:107)1 on (\u03b2(cid:63); \u03b3(cid:63)\n\u03b2 = 0p,\n\u03c1(cid:63)\nbecause \u03b2 is not assumed to be sparse and hence is not penalized. Assumption 2 slightly strengthens\nthis by a supremum over \u03c1, for uniform sparse recovery independent to a particular sign pattern of \u03b3(cid:63).\n\n\u2264 1 \u2212 \u03b7, where \u03c1(cid:63)\n\n\u2020\n(\u03b2,S),(\u03b2,S)\u03c1(cid:63)\n\n(cid:18)0p\n\n(cid:13)(cid:13)(cid:13)\u221e\n\nS). Here \u03c1(cid:63)\n\n(\u03b2,S) =\n\n\u03c1(cid:63)\nS\n\n(\u03b2,S)\n\n.\n\n2.2 Equivalent Conditions and a Comparison Theorem\n\nThe assumptions above, though being natural, are not convenient to compare with that in [Vai+13].\nHere we present some equivalent conditions, followed by a comparison theorem showing that IRR\ncan be weaker than IC in [Vai+13], a necessary and suf\ufb01cient for model selection consistency of\ngeneralized Lasso.\nFirst of all, we introduce some notations. Given \u03b3, minimizing (cid:96) solves \u03b2 = A\u2020(\u03bdX\u2217y + DT \u03b3),\nwhere A := \u03bdX\u2217X + DT D. Substituting A\u2020(\u03bdX\u2217y + DT \u03b3k) for \u03b2k in (1.4b), and dropping (1.4a),\nwe have\n\nzk+1 = zk + \u03b1(DA\u2020X\u2217y \u2212 \u03a3\u03b3k),\n\u03b3k+1 = \u03ba \u00b7 prox(cid:107)\u00b7(cid:107)1(zk+1),\n\n\u03a3 :=(cid:0)I \u2212 DA\u2020DT(cid:1) /\u03bd, A = \u03bdX\u2217X + DT D.\n\nwhere\n\n(2.5)\nIn other words, \u03a3 is the Schur complement of H\u03b2,\u03b2 in Hessian matrix H(\u03bd). Comparing (2.4) with\nthe standard LBI (D = I) studied in [Osh+16], we know that \u03a3 in our paper plays the similar role of\nX\u2217X in their paper. In order to obtain path consistency results of standard LBI in [Osh+16], they\npropose \u201cRestricted Strong Convexity\u201d and \u201cIrrpresentable Condition\u201d on X\u2217X. So in this paper,\nwe can obtain similar assumptions on \u03a3 (instead of H), which actually prove to be equivalent with\nAssumption 1 and 2, and closely related to literature.\nPrecisely, by Lemma 6 in Supplementary Information we know that Assumption 1 is equivalent to\n\n(2.4a)\n(2.4b)\n\n4\n\n\fAssumption 1(cid:48) (Restricted Strong convexity (RSC)). There is a constant \u03bb\u03a3 > 0 such that\n\n\u03a3S,S (cid:23) \u03bb\u03a3I.\n\n(2.6)\nRemark 3. Lemma 2 in Supplementary Information says \u03a3S,S (cid:31) 0 \u21d4 ker(DSc) \u2229 ker(X) \u2286\nker(DS), which is also a natural assumption for the uniqueness of \u03b2(cid:63). Actually, if it fails, then there\nwill be some \u03b2 such that DSc \u03b2 = 0, X\u03b2 = 0 while DS\u03b2 (cid:54)= 0. Thus for any \u03b2(cid:48)(cid:63) := \u03b2(cid:63) + \u03b2, we have\ny = X\u03b2(cid:48)(cid:63) + \u0001, supp(D\u03b2(cid:48)(cid:63)) \u2286 supp(D\u03b2(cid:63)) = S, while DS\u03b2(cid:48)(cid:63) (cid:54)= DS\u03b2(cid:63). Therefore one can neither\nestimate \u03b2(cid:63) nor DS\u03b2(cid:63) even if the support set is known or has been exactly recovered.\nWhen \u03a3S,S (cid:31) 0, Lemma 7 in Supplementary Information implies that Assumption 2 is equivalent to\nAssumption 2(cid:48) (Irrepresentable condition (IRR)). There is a constant \u03b7 \u2208 (0, 1] such that\n\nS,S\n\n(2.7)\nRemark 4. For standard Lasso problems (D = I), it is easy to derive \u03a3 = X\u2217(1 + \u03bdXX\u2217)\u22121X \u2248\nX\u2217X when \u03bd is small. So Assumption 1(cid:48) approximates the usual Restricted Strong Convexity\nSXS (cid:23) \u03bb\u03a3I and Assumption 2(cid:48) approximates the usual Irrepresentable Condition\nassumption X\u2217\n(cid:107)X\u2217\nSXS)\u22121(cid:107)\u221e \u2264 1 \u2212 \u03b7 for standard Lasso problems.\nSc XS(X\u2217\nThe left hand side of (2.7) depends on parameter \u03bd. From now on, de\ufb01ne\n\n(cid:13)(cid:13)(cid:13)\u03a3Sc,S\u03a3\u22121\n\n(cid:13)(cid:13)(cid:13)\u221e\n\n\u2264 1 \u2212 \u03b7.\n\nIRR(\u03bd) :=\n\nIRR(\u03bd), IRR(\u221e) := lim\n\n\u03bd\u2192+\u221e IRR(\u03bd).\n\n(2.8)\n\nNow we are going to compare Assumption 2(cid:48) with the assumption in [Vai+13]. Let W be a matrix\nwhose columns form an orthogonal basis of ker(DSc), and de\ufb01ne\n\n(cid:13)(cid:13)(cid:13)\u03a3Sc,S\u03a3\u22121\n(cid:16)\n\nS,S\n\n(cid:13)(cid:13)(cid:13)\u221e , IRR(0) := lim\n(cid:17)T(cid:16)\n\n\u03bd\u21920\n\n\u2126S :=\n\nIC0 :=(cid:13)(cid:13)\u2126S(cid:13)(cid:13)\u221e , IC1 := min\n\n\u2020\nSc\n\nD\n\nu\u2208ker(DT\n\nSc )\n\nX\u2217XW(cid:0)W T X\u2217XW(cid:1)\u2020\n\n(cid:17)\n\nW T \u2212 I\n\n(cid:13)(cid:13)\u2126Ssign (DS\u03b2(cid:63)) \u2212 u(cid:13)(cid:13)\u221e .\n\nDT\nS ,\n\n[Vai+13] proved the sign consistency of the generalized Lasso estimator of (1.2) for speci\ufb01cally\nchosen \u03bb, under the assumption IC1 < 1 along with ker(DSc) \u2229 ker(X) = {0}. As we shall see\nlater, the same conclusion holds under the assumption IRR(\u03bd) \u2264 1 \u2212 \u03b7 along with Assumption 1(cid:48)\nwhich is equivalent to ker(DSc) \u2229 ker(X) \u2286 ker(DS). Which assumption is weaker to be satis\ufb01ed?\nThe following theorem answers this, whose proof is in Supplementary Information.\nTheorem 1 (Comparisons between IRR in Assumption 2(cid:48) and IC in [Vai+13]).\n\n1. IC0 \u2265 IC1.\n2. IRR(0) exists, and IRR(0) = IC0.\n3. IRR(\u221e) exists, and IRR(\u221e) = 0 if and only if ker(X) \u2286 ker(DS).\n\nFrom this comparison theorem with a design matrix X of full column rank, as \u03bd grows, IRR(\u03bd) <\nIC1 \u2264 IC0, hence Assumption 2(cid:48) is weaker than IC. Now recall the setting of Example 1 where\nker(X) = 0 generically. In the right panel of Figure 1, the (solid and dashed) horizontal red lines\ndenote IC0, IC1, and we see the blue curve denoting IRR(\u03bd) approaches IC0 when \u03bd \u2192 0 and\napproaches 0 when \u03bd \u2192 +\u221e, which illustrates Theorem 1 (here each of IC0, IC1, IRR(\u03bd) is the\nmean of 100 values calculated under 100 generated X\u2019s). Although IRR(0) = IC0 is slightly larger\nthan IC1, IRR(\u03bd) can be signi\ufb01cantly smaller than IC1 if \u03bd is not tiny. On the right side of the\nvertical line, IRR(\u03bd) drops below 1, indicating that Assumption 2(cid:48) is satis\ufb01ed while the assumption\nin [Vai+13] fails.\nRemark 5. Despite that Theorem 1 suggests to adopt a large \u03bd, \u03bd can not be arbitrarily large. From\nAssumption 1(cid:48) and the de\ufb01nition of \u03a3, 1/\u03bd \u2265 (cid:107)\u03a3(cid:107)2 \u2265 (cid:107)\u03a3S,S(cid:107)2 \u2265 \u03bb\u03a3. So if \u03bd is too large, \u03bb\u03a3 has to\nbe small enough, which will deteriorates the estimator in terms of (cid:96)2 error shown in the next.\n\n2.3 Consistency of Split LBI\n\nWe are ready to establish the theorems on path consistency of Split LBI (1.4), under Assumption 1\nand 2. The proofs are based on a careful treatment of the limit dynamics of (1.4) and collected in\nSupplementary Information. Before stating the theorems, we need some de\ufb01nitions and constants.\n\n5\n\n\f(2.9)\n\n,\n\n(2.10)\n\nLet the compact singular value decomposition (compact SVD) of D be\n\nand (V, \u02dcV ) be an orthogonal square matrix. Let the compact SVD of X \u02dcV /\n\nD = U \u039bV T (cid:0)\u039b \u2208 Rr\u00d7r, \u039b (cid:31) 0, U \u2208 Rm\u00d7r, V \u2208 Rp\u00d7r(cid:1) ,\n, V1 \u2208 R(p\u2212r)\u00d7r(cid:48)(cid:17)\n\u039bX =(cid:112)\u039bmax (X\u2217X), \u03bbD = \u03bbmin (\u039b) , \u039bD = \u039bmax (\u039b) , \u03bb1 = \u03bbmin (\u039b1) .\n\n, \u039b1 (cid:31) 0, U1 \u2208 Rn\u00d7r(cid:48)\n\n\u039b1 \u2208 Rr(cid:48)\u00d7r(cid:48)\n\nn = U1\u039b1V T\n1\n\n(cid:16)\n\nn be\n\nand let (V1, \u02dcV1) be an orthogonal square matrix. Let\n\nX \u02dcV /\n\n\u221a\n\n\u221a\n\n(2.11)\nWe see \u039bD is the largest singular value of D, \u03bbD is the smallest nonzero singular value of D, and \u03bb2\n1\nis the smallest nonzero eigenvalue of \u02dcV T X\u2217X \u02dcV . If D has full column rank, then r = p, r(cid:48) = 0, and\n\u02dcV , U1, \u039b1, V1, \u03bb1 all drop, while \u02dcV1 \u2208 R(p\u2212r)\u00d7(p\u2212r) is an orthogonal square matrix.\nThe following theorem says that under Assumption 1 and 2, Split LBI will automatically evolve in\nan \u201coracle\u201d subspace (unknown to us) restricted within the support set of (\u03b2(cid:63), \u03b3(cid:63)) before leaving it,\nand if the signal parameters is strong enough, sign consistency will be reached. Moreover, (cid:96)2 error\nbounds on \u03b3k and \u03b2k are given.\nTheorem 2 (Consistency of Split LBI). Under Assumption 1 and 2, suppose \u03ba is large enough to\nsatisfy\n\n(cid:18)\n\n\u03ba \u2265 4\n\u03b7\n\n1 +\n\n1\n\u03bbD\n\n+\n\n\u039bX\n\u03bb1\u03bbD\n\nand \u03ba\u03b1(cid:107)H(cid:107)2 < 2. Let\n\u03b7\n8\u03c3\n\n\u00af\u03c4 :=\n\n\u00b7 \u03bbD\n\u039bX\n\nX + \u039b2\n\n\uf8f6\uf8f8\n(cid:18) \u039bX\n\nD)\n\n2\u03c3\n\u03bbH\n\n+\n\n\u03bbD\n\n(cid:115)\n\n(cid:19)\uf8eb\uf8ed1 +\n(cid:18)\n(cid:114) n\n\n\u00b7\n\n2 (1 + \u03bd\u039b2\n\u03bbH \u03bd\n(1 + \u039bD)(cid:107)\u03b2(cid:63)(cid:107)2 +\n(cid:107)\n(cid:106) \u00af\u03c4\n\n, K :=\n\nlog m\n\n\u03b1\n\n, \u03bb(cid:48)\n\nH := \u03bbH (1 \u2212 \u03ba\u03b1(cid:107)H(cid:107)2/2) > 0.\n\nThen with probability not less than 1 \u2212 6/m \u2212 3 exp(\u22124n/5), we have all the following properties.\n1. No-false-positive: The solution has no false-positive, i.e. supp(\u03b3k) \u2286 S, for 0 \u2264 k\u03b1 \u2264 \u03c4.\n2. Sign consistency of \u03b3k: Once the signal is strong enough such that\n\nmin := (DS\u03b2(cid:63))min \u2265\n\u03b3(cid:63)\n\n16\u03c3\n\n\u03b7\u03bb(cid:48)\n\nH (1 \u2212 5\u03b1/\u00af\u03c4 )\n\n\u00b7 \u039bX \u039bD\n\n\u03bb2\nD\n\n(2 log s + 5 + log(8\u039bD))\n\nthen \u03b3k has sign consistency at K, i.e. sign (\u03b3K) = sign (D\u03b2(cid:63)).\n\n(cid:19)(cid:19)\n\n,\n\n(2.12)\n\n\u039bX\n\u03bb2\nD\n\n+\n\n\u03bbH \u03bb2\n\nD + \u039b2\nX\n\u03bb1\u03bb2\nD\n\n(cid:114)\n\nlog m\n\n,\n\nn\n(2.13)\n\n(cid:114)\n\ns log m\n\nn\n\n.\n\n(cid:114)\n\ns log m\n\n(cid:114)\n\nr(cid:48) log m\n\n+\n\n2\u03c3\n\u03bb1\n\nn\n\nn\n+ \u03bd \u00b7 2\u03c3 \u00b7 \u03bb1\u039bX + \u039b2\n\nX\n\n.\n\n\u03bb1\u03bb2\nD\n\n3. (cid:96)2 consistency of \u03b3k:\n\n(cid:107)\u03b3K \u2212 D\u03b2(cid:63)(cid:107)2 \u2264\n\n4. (cid:96)2 consistency of \u03b2k:\n\n42\u03c3\n\n\u03b7\u03bb(cid:48)\n\nH (1 \u2212 \u03b1/\u00af\u03c4 )\n\n\u00b7 \u039bX\n\u03bbD\n\n(cid:107)\u03b2K \u2212 \u03b2(cid:63)(cid:107)2 \u2264\n\n42\u03c3\n\n\u03b7\u03bb(cid:48)\n\nH (1 \u2212 \u03b1/\u00af\u03c4 )\n\n\u00b7 \u03bb1\u039bX (1 + \u03bbD) + \u039b2\n\nX\n\n\u03bb1\u03bb2\nD\n\nDespite that the sign consistency of \u03b3k can be established here, usually one can not expect D\u03b2k\nrecovers the sparsity pattern of \u03b3(cid:63) due to the variable splitting. As shown in the last term of (cid:96)2 error\nbound of \u03b2k, increasing \u03bd will sacri\ufb01ce its accuracy. However, one can remedy this by projecting\n\u03b2k on to a subspace using the support set of \u03b3k, and obtain a good estimator \u02dc\u03b2k with both sign\nconsistency and (cid:96)2 consistency at the minimax optimal rates.\n\n6\n\n\fConsequently, if additionally SK = S, then the last term on the right hand side drops for\nk = K, and it reaches\n\n(cid:13)(cid:13)(cid:13) \u02dc\u03b2K \u2212 \u03b2(cid:63)(cid:13)(cid:13)(cid:13)2\n\n\u2264\n\n80\u03c3\n\n\u03b7\u03bb(cid:48)\n\nH (1 \u2212 \u03b1/\u00af\u03c4 )\n\n\u00b7 \u039bX\n\n(cid:13)(cid:13)(cid:13)D\n\n\u2020\nSc\nk\n\nk\u2229S\u03b2(cid:63)(cid:13)(cid:13)(cid:13)2\n\nDSc\n\n.\n\n+ 2\n\nr(cid:48) log m\n\nn\n\n(cid:114)\n(cid:0)\u039bD + \u03bb2\n(cid:1)\n(cid:18) \u039bX\n\n\u03bb3\nD\n\nD\n\n+\n\n2\u03c3\n\u03bb(cid:48)\n\ns log m\n\nn\n\u03bb(cid:48)\nH \u03bb2\n\n+\n\nD + \u039b2\nX\n\u03bb1\u03bb2\nD\n\n(cid:19)(cid:114)\n\nr(cid:48) log m\n\nn\n\n.\n\nTheorem 3 (Consistency of revised version of Split LBI). Under Assumption 1 and 2, suppose \u03ba is\nlarge enough to satisfy (2.12), and \u03ba\u03b1(cid:107)H(cid:107)2 < 2. \u00af\u03c4 , K, \u03bb(cid:48)\nH are de\ufb01ned the same as in Theorem 2.\n(cid:17) = I \u2212 D\nDe\ufb01ne\n\n, \u02dc\u03b2k := PSk \u03b2k.\n\n(cid:16)\n\nSk := supp(\u03b3k), PSk := P\n\nDSc\nk = \u2205, de\ufb01ne PSk = I. Then we have the following properties.\n\nDSc\nk\n\nker\n\nk\n\nIf Sc\n\n\u2020\nSc\nk\n\n1. Sign consistency of \u02dc\u03b2k: If the \u03b3(cid:63)\n\nmin condition (2.13) holds, then with probability not less\n\nthan 1 \u2212 8/m \u2212 3 exp(\u22124n/5), there holds sign(D \u02dc\u03b2K) = sign(D\u03b2(cid:63)).\n\n2. (cid:96)2 consistency of \u02dc\u03b2k: With probability not less than 1 \u2212 8/m \u2212 2r(cid:48)/m2 \u2212 3 exp(\u22124n/5),\n\nwe have that for 0 \u2264 k\u03b1 \u2264 \u00af\u03c4,\n\n(cid:32)\n\n(cid:13)(cid:13)(cid:13) \u02dc\u03b2k \u2212 \u03b2(cid:63)(cid:13)(cid:13)(cid:13)2\n\n\u2264\n\n(cid:114)\n\n(cid:33)\n(cid:19)(cid:114)\n\nn\n\n\u221a\n10\ns\n\u03bb(cid:48)\nH k\u03b1\n\n+\n\n2\u03c3\n\u03bb(cid:48)\n\nH\n\n2\u03c3\n\u03bb(cid:48)\n\n+\n\n(cid:18) \u039bX\n\nH\n\n+\n\n\u03bb2\nD\n\n\u00b7 \u039bX \u039bD\n\ns log m\n\n\u03bb3\nD\n\u03bb(cid:48)\nD + \u039b2\nH \u03bb2\nX\n\u03bb1\u03bb2\nD\n\nRemark 6. Note that r(cid:48) \u2264 min(n, p\u2212 r). In many real applications, r(cid:48) is very small. So the dominant\n\n(cid:96)2 error rate is O((cid:112)s log m/n), which is minimax optimal [LST13; LYY13].\n\nH\n\n\u03bb2\nD\n\n3 Experiments\n\n3.1 Parameter Setting\n\nParameter \u03ba should be large enough according to (2.12). Moreover, step size \u03b1 should be small\nenough to ensure the stability of Split LBI. When \u03bd, \u03ba are determined, \u03b1 can actually be determined\nby \u03b1 = \u03bd/(\u03ba(1 + \u03bd\u039b2\n\nD)) (see (C.6) in Supplementary Information).\n\nX + \u039b2\n\n3.2 Application: Image Denoising\nConsider the image denoising problem in [TT11]. The original image is resized to 50 \u00d7 50, and reset\nwith only four colors, as in the top left image in Figure 2. Some noise is added by randomly changing\nsome pixels to be white, as in the bottom left. Let G = (V, E) is the 4-nearest-neighbor grid graph on\npixels, then \u03b2 = (\u03b2R, \u03b2G, \u03b2B) \u2208 R3|V | since there are 3 color channels (RGB channels). X = I3|V |\nand D = diag(DG, DG, DG), where DG\u03b4 \u2208 R|E|\u00d7|V | is the gradient operator on graph G de\ufb01ned\nby (DGx)(eij) = xi \u2212 xj, eij \u2208 E. Set \u03bd = 180, \u03ba = 100. The regularization path of Split LBI is\nshown in Figure 2, where as t evolves, images on the path gradually select visually salient features\nbefore picking up the random noise. Now compare the AUC (Area Under Curve) of genlasso and\nSplit LBI algorithm with different \u03bd. For simplicity we show the AUC corresponding to the red\ncolor channel. Here \u03bd \u2208 {1, 20, 40, 60, . . . , 300}. As shown in the right panel of Figure 2, with the\nincrease of \u03bd, Split LBI beats genlasso with higher AUC values.\n\n3.3 Application: Partial Order Ranking for Basketball Teams\n\nHere we consider a new application on the ranking of p = 12 FIBA basketball teams into partial\norders. The teams are listed in Figure 3. We collected n = 134 pairwise comparison game results\nmainly from various important championship such as Olympic Games, FIBA World Championship\n\n7\n\n\fFigure 2: Left is image denoising results by Split LBI. Right shows the AUC of Split LBI (blue solid\nline) increases and exceeds that of genlasso (dashed red line) as \u03bd increases.\n\nFigure 3: Partial order ranking for basketball teams. Top left: {\u03b2\u03bb} (t = 1/\u03bb) by genlasso and\n\u02dc\u03b2k (t = k\u03b1) by Split LBI. Top right: grouping result just passing t5. Bottom: FIBA ranking.\n\njk\n\n\u2212 \u03b2(cid:63)\n\nand FIBA Basketball Championship in 5 continents from 2006\u20132014 (8 years is not too long for teams\nto keep relatively stable levels while not too short to have enough samples). For each sample indexed\nby k and corresponding team pair (i, j), yk = si \u2212 sj is the score difference between team i and j.\n+ \u0001k where \u03b2(cid:63) \u2208 Rp measures the strength of these teams. So the\nWe assume a model yk = \u03b2(cid:63)\nik\ndesign matrix X \u2208 Rn\u00d7p is de\ufb01ned by its k-th row: xk,ik = 1, xk,jk = \u22121, xk,l = 0 (l (cid:54)= ik, jk).\nIn sports, teams of similar strength often meet than those in different levels. Thus we hope to \ufb01nd a\ncoarse grained partial order ranking by adding a structural sparsity on D\u03b2(cid:63) where D = cX (c scales\nthe smallest nonzero singular value of D to be 1).\nThe top left panel of Figure 3 shows {\u03b2\u03bb} by genlasso and \u02dc\u03b2k by Split LBI with \u03bd = 1 and \u03ba = 100.\nBoth paths give the same partial order at early stages, though the Split LBI path looks qualitatively\nbetter. For example, the top right panel shows the same partial order after the change point t5. It is\ninteresting to compare it against the FIBA ranking in September, 2014, shown in the bottom. Note\nthat the average basketball level in Europe is higher than that of in Asia and Africa, hence China can\nget more FIBA points than Germany based on the dominant position in Asia, so is Angola in Africa.\nBut their true levels might be lower than Germany, as indicated in our results. Moreover, America\n(FIBA points 1040.0) itself forms a group, agreeing with the common sense that it is much better\nthan any other country. Spain, having much higher FIBA ranking points (705.0) than the 3rd team\nArgentina (455.0), also forms a group alone. It is the only team that can challenge America in recent\nyears, and it enters both \ufb01nals against America in 2008 and 2012.\n\nAcknowledgments\n\nThe authors were supported in part by National Basic Research Program of China under grants\n2012CB825501 and 2015CB856000, as well as NSFC grants 61071157 and 11421110001.\n\n8\n\nOriginal Figuret =9.3798t =23.7812Noisy Figuret =60.5532t =617.1275\fReferences\n\n[AT16]\n\n[BY02]\n\n[Efr+04]\n\n[GO09]\n\n[Hoe10]\n\n[LST13]\n\n[LYY13]\n\n[Moe12]\n\n[Osh+16]\n\n[ROF92]\n\n[RT14]\n\n[SSR12]\n\n[Tib+05]\n\n[Tib96]\n\n[Tro04]\n\n[TT11]\n\n[Vai+13]\n\n[Wah+12]\n\n[Yin+08]\n\n[YRC07]\n\n[YX11]\n\n[Zha06]\n\n[Zhu15]\n\n[ZY06]\n\nTaylor B. Arnold and Ryan J. Tibshirani. \u201cEf\ufb01cient Implementations of the Generalized Lasso Dual\nPath Algorithm\u201d. In: Journal of Computational and Graphical Statistics 25.1 (2016), pp. 1\u201327.\nPeter B\u00fchlmann and Bin Yu. \u201cBoosting with the L2-Loss: Regression and Classi\ufb01cation\u201d. In:\nJournal of American Statistical Association 98 (2002), pp. 324\u2013340.\nB. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. \u201cLeast angle regression\u201d. In: The Annals of\nstatistics 32.2 (2004), pp. 407\u2013499.\nTom Goldstein and Stanley Osher. \u201cSplit Bregman method for large scale fused Lasso\u201d. In: SIAM\nJournal on Imaging Sciences 2.2 (2009), pp. 323\u2013343.\nHolger Hoe\ufb02ing. \u201cA Path Algorithm for the Fused Lasso Signal Approximator\u201d. In: Journal of\nComputational and Graphical Statistics 19.4 (2010), pp. 984\u20131006.\nJason D Lee, Yuekai Sun, and Jonathan E Taylor. \u201cOn model selection consistency of penalized\nM-estimators: a geometric theory\u201d. In: Advances in Neural Information Processing Systems (NIPS)\n26. 2013, pp. 342\u2013350.\nJi Liu, Lei Yuan, and Jieping Ye. \u201cGuaranteed Sparse Recovery under Linear Transformation\u201d. In:\nProceedings of The 30th International Conference on Machine Learning. 2013, pp. 91\u201399.\nMichael Moeller. \u201cMultiscale Methods for Polyhedral Regularizations and Applications in High\nDimensional Imaging\u201d. PhD thesis. Germany: University of Muenster, 2012.\nStanley Osher, Feng Ruan, Jiechao Xiong, Yuan Yao, and Wotao Yin. \u201cSparse recovery via\ndifferential inclusions\u201d. In: Applied and Computational Harmonic Analysis (2016). DOI: 10.\n1016/j.acha.2016.01.002.\nLeonid I. Rudin, Stanley Osher, and Emad Fatemi. \u201cNonlinear Total Variation Based Noise\nRemoval Algorithms\u201d. In: Physica D: Nonlinear Phenomena 60.1-4 (Nov. 1992), pp. 259\u2013268.\nAaditya Ramdas and Ryan J. Tibshirani. \u201cFast and Flexible ADMM Algorithms for Trend Filter-\ning\u201d. In: Journal of Computational and Graphical Statistics (2014). DOI: 10.1080/10618600.\n2015.1054033.\nJames Sharpnack, Aarti Singh, and Alessandro Rinaldo. \u201cSparsistency of the edge lasso over\ngraphs\u201d. In: International Conference on Arti\ufb01cial Intelligence and Statistics. 2012, pp. 1028\u2013\n1036.\nRobert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. \u201cSparsity and\nsmoothness via the fused lasso\u201d. In: Journal of the Royal Statistical Society Series B (2005),\npp. 91\u2013108.\nRobert Tibshirani. \u201cRegression shrinkage and selection via the lasso\u201d. In: Journal of the Royal\nStatistical Society. Series B (Methodological) (1996), pp. 267\u2013288.\nJoel A. Tropp. \u201cGreed is good: Algorithmic results for sparse approximation\u201d. In: IEEE Trans.\nInform. Theory 50.10 (2004), pp. 2231\u20132242.\nRyan J. Tibshirani and Jonathan Taylor. \u201cThe solution path of the generalized lasso\u201d. In: The\nAnnals of Statistics 39.3 (June 2011), pp. 1335\u20131371.\nS. Vaiter, G. Peyre, C. Dossal, and J. Fadili. \u201cRobust Sparse Analysis Regularization\u201d. In: IEEE\nTransactions on Information Theory 59.4 (Apr. 2013), pp. 2001\u20132016.\nBo Wahlberg, Stephen Boyd, Mariette Annergren, and Yang Wang. \u201cAn ADMM Algorithm for a\nClass of Total Variation Regularized Estimation Problems\u201d. In: IFAC Proceedings Volumes. 16th\nIFAC Symposium on System Identi\ufb01cation 45.16 (2012), pp. 83\u201388.\nWotao Yin, Stanley Osher, Jerome Darbon, and Donald Goldfarb. \u201cBregman Iterative Algorithms\nfor Compressed Sensing and Related Problems\u201d. In: SIAM Journal on Imaging Sciences 1.1 (2008),\npp. 143\u2013168.\nYuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. \u201cOn Early Stopping in Gradient Descent\nLearning\u201d. In: Constructive Approximation 26.2 (2007), pp. 289\u2013315.\nGui-Bo Ye and Xiaohui Xie. \u201cSplit Bregman method for large scale fused Lasso\u201d. In: Computa-\ntional Statistics & Data Analysis 55.4 (2011), pp. 1552\u20131569.\nFuzhen Zhang. The Schur Complement and Its Applications. Springer Science & Business Media,\n2006. 308 pp. ISBN: 978-0-387-24273-6.\nYunzhang Zhu. \u201cAn augmented ADMM algorithm with application to the generalized lasso prob-\nlem\u201d. In: Journal of Computational and Graphical Statistics (2015). DOI: 10.1080/10618600.\n2015.1114491.\nPeng Zhao and Bin Yu. \u201cOn Model Selection Consistency of Lasso\u201d. In: Journal of Machine\nLearning Research 7 (2006), pp. 2541\u20132567.\n\n9\n\n\f", "award": [], "sourceid": 1672, "authors": [{"given_name": "Chendi", "family_name": "Huang", "institution": "Peking University"}, {"given_name": "Xinwei", "family_name": "Sun", "institution": "Peking University"}, {"given_name": "Jiechao", "family_name": "Xiong", "institution": "Peking University"}, {"given_name": "Yuan", "family_name": "Yao", "institution": "Stanford University"}]}