{"title": "Projection onto A Nonnegative Max-Heap", "book": "Advances in Neural Information Processing Systems", "page_first": 487, "page_last": 495, "abstract": "We consider the problem of computing the Euclidean projection of a vector of length $p$ onto a non-negative max-heap---an ordered tree where the values of the nodes are all nonnegative and the value of any parent node is no less than the value(s) of its child node(s). This Euclidean projection plays a building block role in the optimization problem with a non-negative max-heap constraint. Such a constraint is desirable when the features follow an ordered tree structure, that is, a given feature is selected for the given regression/classification task only if its parent node is selected. In this paper, we show that such Euclidean projection problem admits an analytical solution and we develop a top-down algorithm where the key operation is to find the so-called \\emph{maximal root-tree} of the subtree rooted at each node. A naive approach for finding the maximal root-tree is to enumerate all the possible root-trees, which, however, does not scale well. We reveal several important properties of the maximal root-tree, based on which we design a bottom-up algorithm with merge for efficiently finding the maximal root-tree. The proposed algorithm has a (worst-case) linear time complexity for a sequential list, and $O(p^2)$ for a general tree. We report simulation results showing the effectiveness of the max-heap for regression with an ordered tree structure. Empirical results show that the proposed algorithm has an expected linear time complexity for many special cases including a sequential list, a full binary tree, and a tree with depth 1.", "full_text": "Projection onto A Nonnegative Max-Heap\n\nJun Liu\n\nLiang Sun\n\nJieping Ye\n\nArizona State University\nTempe, AZ 85287, USA\n\nArizona State University\nTempe, AZ 85287, USA\n\nArizona State University\nTempe, AZ 85287, USA\n\nj.liu@asu.edu\n\nsun.liang@asu.edu\n\njieping.ye@asu.edu\n\nAbstract\n\nWe consider the problem of computing the Euclidean projection of a vector\nof length p onto a non-negative max-heap\u2014an ordered tree where the val-\nues of the nodes are all nonnegative and the value of any parent node is no\nless than the value(s) of its child node(s). This Euclidean projection plays\na building block role in the optimization problem with a non-negative max-\nheap constraint. Such a constraint is desirable when the features follow\nan ordered tree structure, that is, a given feature is selected for the given\nregression/classi\ufb01cation task only if its parent node is selected. In this pa-\nper, we show that such Euclidean projection problem admits an analytical\nsolution and we develop a top-down algorithm where the key operation is\nto \ufb01nd the so-called maximal root-tree of the subtree rooted at each node.\nA naive approach for \ufb01nding the maximal root-tree is to enumerate all the\npossible root-trees, which, however, does not scale well. We reveal several\nimportant properties of the maximal root-tree, based on which we design a\nbottom-up algorithm with merge for e\ufb03ciently \ufb01nding the maximal root-\ntree. The proposed algorithm has a (worst-case) linear time complexity\nfor a sequential list, and O(p2) for a general tree. We report simulation\nresults showing the e\ufb00ectiveness of the max-heap for regression with an or-\ndered tree structure. Empirical results show that the proposed algorithm\nhas an expected linear time complexity for many special cases including a\nsequential list, a full binary tree, and a tree with depth 1.\n\n1\n\nIntroduction\n\nIn many regression/classi\ufb01cation problems, the features exhibit certain hierarchical or struc-\ntural relationships, the usage of which can yield an interpretable model with improved regres-\nsion/classi\ufb01cation performance [25]. Recently, there have been increasing interests on struc-\ntured sparisty with various approaches for incorporating structures; see [7, 8, 9, 17, 24, 25]\nand references therein. In this paper, we consider an ordered tree structure: a given feature\nis selected for the given regression/classi\ufb01cation task only if its parent node is selected. To\nincorporate such ordered tree structure, we assume that the model parameter x \u2208 Rp follows\nthe non-negative max-heap structure1:\n\nP = {x \u2265 0, xi \u2265 xj \u2200(xi, xj) \u2208 Et},\n\n(1)\n\nwhere T t = (V t, Et) is a target tree with V t = {x1, x2, . . . , xp} containing all the nodes and\nEt all the edges. The constraint set P implies that if xi is the parent node of a child node\nxj then the value of xi is no less than the value of xj. In other words, if a parent node xi is\n0, then any of its child nodes xj is also 0. Figure 1 illustrates three special tree structures:\n1) a full binary tree, 2) a sequential list, and 3) a tree with depth 1.\n\n1To deal with the negative model parameters, one can make use of the technique employed\n\nin [24], which solves the scaled version of the least square estimate.\n\n1\n\n\fx1\n\nx2\n\nx3\n\nx1\n\nx2\n\nx3\n\nx4\n\nx5\n\nx6\n\nx7\n\nx1\n\nx2\n\nx3\n\nx4\n\nx5\n\nx6\n\nx7\n\nx4\n\nx5\n\nx6\n\nx7\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: Illustration of a non-negative max-heap depicted in (1). Plots (a), (b), and (c) correspond\nto a full binary tree, a sequential list, and a tree with depth 1, respectively.\n\nThe set P de\ufb01ned in (1) induces the so-called \u201cheredity principle\u201d [3, 6, 18, 24], which has\nbeen proven e\ufb00ective for high-dimensional variable selection. In a recent study [12], Li et al.\nconducted a meta-analysis of 113 data sets from published factorial experiments and con-\ncluded that an overwhelming majority of these real studies conform with the heredity princi-\nples. The ordered tree structure is a special case of the non-negative garrote discussed in [24]\nwhen the hierarchical relationship is depicted by a tree. Therefore, the asymptotic properties\nestablished in [24] are applicable to the ordered tree structrue. Several related approaches\nthat can incorporate the ordered tree structure include the Wedge approach [17] and the\nhierarchical group Lasso [25]. The Wedge approach incorporates such ordering information\nby designing a penalty for the model parameter x as \u2126(x|P ) = inf t\u2208P\n+ ti), with\ntree being a sequential list. By imposing the mixed \u21131-\u21132 norm on each group formed by\nthe nodes in the subtree of a parent node, the hierarchical group Lasso is able to incorpo-\nrate such ordering information. The hierarchical group Lasso has been applied for multi-task\nlearning in [11] with a tree structure, and the e\ufb03cient computation was discussed in [10, 15].\nCompared to Wedge and hierarchical group Lasso, the max-heap in (1) incorporates such\nordering information in a direct way, and our simulation results show that the max-heap\ncan achieve lower reconstruction error than both approaches.\n\n2Pp\n\n1\n\ni=1( x2\n\ni\nti\n\nIn estimating the model parameter satisfying the ordered tree structure, one needs to solve\nthe following constrained optimization problem:\n\nmin\nx\u2208P\n\nf (x)\n\n(2)\n\nfor some convex function f (\u00b7). The problem (2) can be solved via many approaches including\nsubgradient descent, cutting plane method, gradient descent, accelerated gradient descent,\netc [19, 20]. In applying these approaches, a key building block is the so-called Euclidean\nprojection of a vector v onto the convex set P :\n\n\u03c0P (v) = arg min\nx\u2208P\n\n1\n2\n\nkx \u2212 vk2\n2,\n\n(3)\n\nwhich ensures that the solution belongs to the constraint set P . For some special set P (e.g.,\nhyperplane, halfspace, and rectangle), the Euclidean projection admits a simple analytical\nsolution, see [2]. In the literature, researchers have developed e\ufb03cient Euclidean projection\nalgorithms for the \u21131-ball [5, 14], the \u21131/\u21132-ball [1], and the polyhedra [4, 22]. When P is\ninduced by a sequential list, a linear time algorithm was recently proposed in [26]. Without\nthe non-negative constraints, problem (3) is the so-called isotonic regression problem [16, 21].\n\nOur major technical contribution in this paper is the e\ufb03cient computation of (3) for the set\nP de\ufb01ned in (1). In Section 2, we show that the Euclidean projection admits an analytical\nsolution, and we develop a top-down algorithm where the key operation is to \ufb01nd the\nso-called maximal root-tree of the subtree rooted at each node.\nIn Section 3, we design\na bottom-up algorithm with merge for e\ufb03ciently \ufb01nding the maximal root-tree by using\nits properties. We provide empirical results for the proposed algorithm in Section 4, and\nconclude this paper in Section 5.\n\n2 Atda: A Top-Down Algorithm\n\nIn this section, we develop an algorithm in a top-down manner called Atda for solving (3).\nWith the target tree T t = (V t, Et), we construct the input tree T = (V, E) with the input\nvector v, where V = {v1, v2, . . . , vp} and E = {(vi, vj)|(xi, xj) \u2208 Et}. For the convenience\nof presenting our proposed algorithm, we begin with several de\ufb01nitions. We also provide\nsome examples for elaborating the de\ufb01nitions in the supplementary \ufb01le A.1.\n\n2\n\n\fDe\ufb01nition 1. For a non-empty tree T = (V, E), we de\ufb01ne its root-tree as any non-empty\ntree \u02dcT = ( \u02dcV , \u02dcE) that satis\ufb01es: 1) \u02dcV \u2286 V , 2) \u02dcE \u2286 E, and 3) \u02dcT shares the same root as T .\nDe\ufb01nition 2. For a non-empty tree T = (V, E), we de\ufb01ne R(T ) as the root-tree set con-\ntaining all its root-trees.\n\nDe\ufb01nition 3. For a non-empty tree T = (V, E), we de\ufb01ne\n\nm(T ) = max(cid:18)Pvi\u2208V vi\n\n|V |\n\n, 0(cid:19) ,\n\n(4)\n\nwhich equals the mean of all the nodes in T if such mean is non-negative, and 0 otherwise.\n\nDe\ufb01nition 4. For a non-empty tree T = (V, E), we de\ufb01ne its maximal root-tree as:\n\nMmax(T ) = arg\n\n\u02dcT =( \u02dcV , \u02dcE): \u02dcT \u2208R(T ),m( \u02dcT )=mmax(T )\n\nmax\n\n| \u02dcV |,\n\nwhere\n\nmmax(T ) = max\n\u02dcT \u2208R(T )\n\nm( \u02dcT )\n\n(5)\n\n(6)\n\nis the maximal value of all the root-trees of the tree T . Note that if two root-trees share the\nsame maximal value, (5) selects the one with the largest tree size.\n\nWhen \u02dcT = ( \u02dcV , \u02dcE) is a part of a \u201clarger\u201d tree T = (V, E), i.e., \u02dcV \u2286 V and \u02dcE \u2286 E, we\ncan treat \u02dcT as a \u201csuper-node\u201d of the tree T with value m( \u02dcT ). Thus, we have the following\nde\ufb01nition of a super-tree (note that a super-tree provides a disjoint partition of the given\ntree):\nDe\ufb01nition 5. For a non-empty tree T = (V, E), we de\ufb01ne its super-tree as S = (VS, ES),\nwhich satis\ufb01es: 1) each node in VS = {T1, T2, . . . , Tn} is a non-empty tree with Ti = (Vi, Ei),\ni=1 Vi, and 4) (Ti, Tj) \u2208 ES if and\n\n2) Vi \u2286 V and Ei \u2286 E, 3) ViT Vj = \u2205, i 6= j and V =Sn\n\nonly if there exists a node in Tj whose parent node is in Ti.\n\n2.1 Proposed Algorithm\n\nWe present the pseudo code for solving (3) in Algorithm 1. The key idea of the proposed\nalgorithm is that, in the i-th call, we \ufb01nd Ti = Mmax(T ), the maximal root-tree of T , set\n\u02dcx corresponding to the nodes of Ti to mi = mmax(T ) = m(Ti), remove Ti from the tree T ,\nand apply Atda to the resulting trees one by one recursively.\n\nAlgorithm 1 A Top-Down Algorithm: Atda\nInput: the tree structure T = (V, E), i\nOutput: \u02dcx \u2208 Rp\n1: Set i = i + 1\n2: Find the maximal root-tree of T , denoted by Ti = (Vi, Ei), and set mi = m(Ti)\n3: if mi > 0 then\n4:\n5:\n\nSet \u02dcxj = mi, \u2200vj \u2208 Vi\nRemove the root-tree Ti from T , denote the resulting trees as \u02dcT1, \u02dcT2, . . . , \u02dcTri , and\napply Atda( \u02dcTj,i), \u2200j = 1, 2, . . . , ri\n\n6: else\n7:\n8: end if\n\nSet \u02dcxj = mi, \u2200vj \u2208 Vi\n\n2.2\n\nIllustration & Justi\ufb01cation\n\nFor a better illustration and justi\ufb01cation of the proposed algorithm, we provide the analysis\nof Atda for a special case\u2014the sequential list\u2014in the supplementary \ufb01le A.2.\n\nLet us analyze Algorithm 1 for the general tree. Figure 2 illustrates solving (3) via Algo-\nrithm 1 for a tree with depth 3. Plot (a) shows a target tree T t, and plot (b) denotes the\ninput tree T . The dashed frame of plot (b) shows Mmax(T ), the maximal root-tree of T , and\n\n3\n\n\fx1\n\nx3\n\nx8\n\nx2\n\nx5\n\nx6\n\nx7\n\nx4\n\n5\n\n1\n\n3\n\n-4\n\n-4\n\nx9\n\nx10\n\nx11\n\n-1\n\n-4\n\n2\n\n-1\n\n2\n\n1\n\n-1\n\n-1\n\n-4\n\n2\n\n-1\n\n2\n\n1\n\n-1\n\nx12\n\nx13\n\nx14\n\nx15\n\n1\n\n2\n\n4\n\n2\n\n1\n\n2\n\n4\n\n2\n\n(a)\n\n1\n\n0\n\n3\n\n0\n\n-1\n\n0\n\n1\n\n0\n\n3\n\n4\n\n(f)\n\n1\n\n2\n\n5\n\n0 0\n-4\n\n-1\n\n1\n\n2\n\n1\n\n1\n\n2\n\n0\n\n2\n\n2\n\n0\n\n-4\n\n5\n\n2\n\n0\n\n0\n\n1\n\n1\n\n2\n\n0\n\n-1\n\n0\n\n1\n\n(b)\n\n3\n\n3\n\n1\n\n5\n\n(c)\n\n1\n\n3\n\n-4\n\n2\n\n0\n\n1\n\n1\n\n0\n\n-1\n\n-4\n\n2\n\n-1\n\n2\n\n1\n\n-1\n\n3\n\n0\n\n0\n\n0\n\n0\n\n1\n\n1\n\n1\n\n2\n\n4\n\n2\n\n(e)\n\n(d)\n\nFigure 2: Illustration of Algorithm 1 for solving (3) for a tree with depth 3. Plot (a) shows the\ntarget tree T t, and plots (b-e) illustrate Atda. Speci\ufb01cally, plot (b) denotes the input tree T ,\nwith the dashed frame displaying its maximal root-tree; plot (c) depicts the resulting trees after\nremoving the maximal root-tree in plot (b); plot (d) shows the resulting super-tree (we treat each\ntree enclosed by the dashed frame as a super-node) of the algorithm; plot (e) gives the solution\n\u02dcx \u2208 R15; and the edges of plot (f) show the dual variables, from which we can also obtain the\noptimal solution \u02dcx (refer to the proof of Theorem 1).\n\nwe have Mmax(T ) = 3. Thus, we set the corresponding entries of \u02dcx to 3. Plot (c) depicts\nthe resulting trees after removing the maximal root-tree in plot (b), and plot (d) shows the\ngenerated maximal root-trees (enclosed by dashed frame) by the algorithm. When treating\neach generated maximal root-tree as a super-node with the value de\ufb01ned in De\ufb01nition 3,\nplot (d) is a super-tree of the input tree T . In addition, the super-tree is a max-heap, i.e.,\nthe value of the parent node is no less than the values of its child nodes. Plot (e) gives the\nsolution \u02dcx \u2208 R15. The edges of plot (f) correspond to the values of the dual variables, from\nwhich we can also obtain the optimal solution \u02dcx \u2208 R15. Finally, we can observe that the\nnon-zero entries of \u02dcx constitute a cut of the original tree.\n\nWe verify the correctness of Algorithm 1 for the general tree in the following theorem. We\nmake use of the KKT conditions and variational inequality [20] in the proof.\nTheorem 1. \u02dcx = Atda(T, 0) provides the unique optimal solution to (3).\n\nProof: As the objective function of (3) is strictly convex and the constraints are a\ufb03ne, it\nadmits a unique solution. After running Algorithm 1, we obtain the sequences {Ti}k\ni=1 and\n{mi}k\ni=1, where k satis\ufb01es 1 \u2264 k \u2264 p. It is easy to verify that the trees Ti, i = 1, 2, . . . , k\nconstitute a disjoint partition of the input tree T . With the sequences {Ti}k\ni=1,\nwe can construct a super-tree of the input tree T as follows: 1) we treat Ti as a super-node\nwith value mi, and 2) we put an edge between Ti and Tj if there is an edge between the\nnodes of Ti and Tj in the input tree T . With Algorithm 1, we can verify that the resulting\nsuper-tree has the property that the value of the parent node is no less than its child nodes.\nTherefore, \u02dcx = Atda(T, 0) satis\ufb01es \u02dcx \u2208 P .\nLet xl and vl denote a subset of x and v corresponding to the indices appearing in the\nsubtree Tl, respectively. Denote P l = {xl : xl \u2265 0, xi \u2265 xj, (vi, vj) \u2208 El}, I1 = {l : ml >\n0}, I2 = {l : ml = 0}. Our proof is based on the following inequality:\n\ni=1 and {mi}k\n\nmin\nx\u2208P\n\n1\n2\n\nkx \u2212 vk2\n\n2 \u2265 Xl\u2208I1\n\nmin\nxl\u2208P l\n\n1\n2\n\nkxl \u2212 vlk2\n\n2 +Xl\u2208I2\n\nmin\nxl\u2208P l\n\n1\n2\n\nkxl \u2212 vlk2\n2,\n\n(7)\n\nwhich holds as the left hand side has the additional inequality constraints compared to the\nright hand side. Our methodology is to show that \u02dcx = Atda(T, 0) provides the optimal\nsolution to the right hand side of (7), i.e.,\n\n\u02dcxl = arg min\nxl\u2208P l\n\n\u02dcxl = arg min\nxl\u2208P l\n\n1\n2\n\n1\n2\n\nkxl \u2212 vlk2\n\n2, \u2200l \u2208 I1,\n\nkxl \u2212 vlk2\n\n2, \u2200l \u2208 I2,\n\n4\n\n(8)\n\n(9)\n\n\fwhich, together with the fact 1\n2, \u02dcx \u2208 P lead to our main\nargument. Next, we prove (8) by the KKT conditions, and prove (9) by the variational\ninequality [20].\n\n2 \u2265 minx\u2208P\n\n2 k\u02dcx \u2212 vk2\n\n2 kx \u2212 vk2\n\n1\n\nFirstly, \u2200l \u2208 I1, we introduce the dual variable yij for the edge (vi, vj) \u2208 El, and yii if\nvi \u2208 Ll, where Ll contains all the leaf nodes of the tree Tl. Denote the root of Tl by vrl .\nFor all vi \u2208 Vl, vi 6= vrl , we denote its parent node by vji , and for the root vrl , we denote\njrl = rl. We let\n\nC l\nRl\n\ni = {j|vj is a child node of vi in the tree Tl}.\ni = {j|vj is in the subtree of Tl rooted at vi}.\n\nTo prove (8), we verify that the primal variable \u02dcx = Atda(T, 0) and the dual variable \u02dcy\nsatisfy the following KKT conditions:\n\n\u2200(vi, vj) \u2208 El, \u02dcxi \u2265 \u02dcxj \u2265 0\n\u2200(vi, vj) \u2208 El, (\u02dcxi \u2212 \u02dcxj)\u02dcyij = 0\n\u2200vi \u2208 Ll, \u02dcyii \u02dcxi = 0\n\n\u2200vi \u2208 Vl, \u02dcxi \u2212 vi \u2212 Xj\u2208C l\n\ni\n\n\u02dcyij + \u02dcyjii = 0\n\n\u2200(vi, vj) \u2208 El, \u02dcyij \u2265 0\n\u2200vi \u2208 Ll, \u02dcyii \u2265 0,\n\n(10)\n(11)\n(12)\n\n(13)\n\n(14)\n(15)\n\nwhere \u02dcyjrl rl = 0 (Note that \u02dcyjrl rl is a dual variable, and it is introduced for the simplicity\nof presenting (12)), and the dual variable \u02dcy is set as:\n\n\u02dcyii = 0, \u2200i \u2208 Ll,\n\n\u02dcyjii = vi \u2212 ml + Xj\u2208C l\n\ni\n\n\u02dcyij, \u2200vi \u2208 Vl.\n\n(16)\n\n(17)\n\nAccording to Algorithm 1, \u02dcxi = ml > 0, \u2200vi \u2208 Vl, l \u2208 I1. Thus, we have (10)-(12) and (15).\nIt follows from (17) that (13) holds. According to (16) and (17), we have\n\n\u02dcyjii = Xj\u2208Rl\n\ni\n\nvj \u2212 |Rl\n\ni|ml, \u2200vi \u2208 Vl,\n\n(18)\n\nwhere |Rl\n\ni| denotes the number of elements in Rl\n\nthe nature of the maximal root-tree Tl, l \u2208 I1, we have Pj\u2208Rl\nPj\u2208Rl\n\ni, the subtree of Tl rooted at vi. From\ni|ml. Otherwise, if\ni|ml, we can construct from Tl a new root-tree \u00afTl by removing the subtree\nof Tl rooted at vi, so that \u00afTl achieves a larger value than Tl. This contradicts with the\nargument that Tl, l \u2208 I1 is the maximal root-tree of the working tree T . Therefore, it\nfollows from (18) that (14) holds.\n\nvj \u2265 |Rl\n\nvj < |Rl\n\ni\n\ni\n\nSecondly, we prove (9) by verifying the following optimality condition:\n\n(19)\nwhich is the so-called variational inequality condition for \u02dcxl being the optimal solution to (9).\nAccording to Algorithm 1, if l \u2208 I2, we have \u02dcxi = 0, \u2200vi \u2208 Vl. Thus, (19) is equivalent to\n\nhxl \u2212 \u02dcxl, \u02dcxl \u2212 vli \u2265 0, \u2200xl \u2208 P l, l \u2208 I2,\n\nhxl, vli \u2264 0, \u2200xl \u2208 P l, l \u2208 I2.\n\n(20)\nFor a given xl \u2208 P l, if xi = 0, \u2200vi \u2208 V l, (20) naturally holds. Next, we consider xl 6= 0.\nl , E1\nDenote by \u00afxl\nl = (V 1\nl ) a tree constructed by\nremoving the nodes corresponding to the indices in the set {i : xl\ni = 0, vi \u2208 Vl} from Tl. It is\nclear that T 1\nvi \u2264 0.\nThus, we have\n\nl shares the same root as Tl. It follows from Algorithm 1 that Pi:vi\u2208V 1\n\n1 the minimal nonzero element in xl, and T 1\n\nl\n\nhxl, vli = \u00afxl\n\n1 Xi:vi\u2208V 1\n\nl\n\nvi + Xi:vi\u2208V 1\n\nl\n\n(xi \u2212 \u00afxl\n\n1)vi \u2264 Xi:vi\u2208V 1\n\nl\n\n(xi \u2212 \u00afxl\n\n1)vi.\n\n5\n\n\fIf xl\n\ni = \u00afxl\n\n1, \u2200vi \u2208 V 1\n\nby removing those nodes with the indices in the set {i : xl\n\nl , we arrive at (20). Otherwise, we set r = 2; denote by \u00afxl\nl , Er\nj = 0, vi \u2208 V r\u22121\n\nr the minimal\nl ) a subtree of\n}.\nand Tl as well, so that it follows from\n\nl shares the same root as T r\u22121\n\nl = (V r\nj=1 \u00afxl\n\nj : vi \u2208 V r\u22121\n\n}, and T r\n\nj=1 \u00afxl\n\ni \u2212Pr\u22121\n\nl\n\nl\n\nl\n\nvi \u2264 0. Therefore, we have\n\nT r\u22121\nl\nIt is clear that T r\n\nnonzero element in the set {xi \u2212Pr\u22121\nAlgorithm 1 that Pi:vi\u2208V r\nXi:vi\u2208V r\u22121\n\nr Xi:vi\u2208V r\n\n\u00afxl\nj)vi = \u00afxl\n\nXj=1\n\n(xi \u2212\n\nr\u22121\n\nl\n\nl\n\nl\n\nvi + Xi:vi\u2208V r\n\nl\n\n(xi \u2212\n\nr\n\nXj=1\n\n\u00afxl\n\nj)vi \u2264 Xi:vi\u2208V r\n\nl\n\n(xi \u2212\n\nr\n\nXj=1\n\n\u00afxl\nj)vi. (21)\n\nRepeating the above process until V r\nl\n\nis empty, we can verify that (20) holds.\n\n(cid:3)\n\nFor a better understanding of the proof, we make use of the edges of Figure 2 (f) to show\nthe dual variables, where the edge connecting vi and vj corresponds to the dual variable \u02dcyij,\nand the edge starting from the leaf node vi corresponds to the dual variable \u02dcyii. With the\ndual variables, we can compute \u02dcx via (13). We note that, for the maximal root-tree with a\npositive value, the associated dual variables are unique, but for the maximal root-tree with\nzero value, the associated dual variables may not be unique. For example, in Figure 2 (f),\nwe set \u02dcyii = 1 for i = 12, \u02dcyii = 0 for i = 13, \u02dcyij = 2 for i = 6, j = 12, and \u02dcyij = 2 for\ni = 6, j = 13. It is easy to check that the dual variables can also be set as follows: \u02dcyii = 0\nfor i = 12, \u02dcyii = 1 for i = 13, \u02dcyij = 1 for i = 6, j = 12, and \u02dcyij = 3 for i = 6, j = 13.\n\n3 Finding the Maximal Root-Tree\n\nA key operation of Algorithm 1 is to \ufb01nd the maximal root-tree used in Step 2. A naive\napproach for \ufb01nding the maximal root-tree of a tree T is to enumerate all possible root-\ntrees in the root-tree set R(T ), and identify the maximal root-tree via (5). We call such\nan approach Anae, which stands for a naive algorithm with enumeration. Although Anae\nis simple to describe, it has a very high time complexity (see the analysis given in supple-\nmentary \ufb01le A.3). To this end, we develop Abuam (A Bottom-Up Algorithm with Merge).\nThe underlying idea is to make use of the special structure of the maximal root-tree de\ufb01ned\nin (5) for avoiding the enumeration of all possible root-trees.\n\nWe begin the discussion with some key properties of the maximal root-tree, and the proof\nis given in the supplementary \ufb01le A.4.\nLemma 1. For a non-empty tree T = (V, E), denote its maximal root-tree as Tmax =\n(Vmax, Emax). Let \u02dcT = ( \u02dcV , \u02dcE) be a root-tree of Tmax. Assume that there are n nodes\nvi1, . . . , vin , which satisfy: 1) vij /\u2208 \u02dcV , 2) vij \u2208 V , and 3) the parent node of vij is in\n\u02dcV . If n \u2265 1, we denote the subtree of T rooted at vij as T j = (V j, Ej), j = 1, 2, . . . , n,\nT j\nmax = (V j\nmax).\nThen, the followings hold: (1) If n = 0, then Tmax = \u02dcT = T ; (2) If n \u2265 1, m( \u02dcT ) = 0, and\n\u02dcm = 0, then Tmax = T ; (3) If n \u2265 1, m( \u02dcT ) > 0, and m( \u02dcT ) > \u02dcm, then Tmax = \u02dcT ; (4) If\nn \u2265 1, m( \u02dcT ) > 0, and m( \u02dcT ) \u2264 \u02dcm, then V j\nmax \u2286 Emax and (vi0 , vij ) \u2208 Emax,\n\u2200j : m(T j\nmax \u2286\nEmax and (vi0 , vij ) \u2208 Emax, \u2200j : m(T j\n\nmax) as the maximal root-trees of T j, and \u02dcm = maxj=1,2,...,n m(T j\n\nmax) = \u02dcm; and (5) If n \u2265 1, m( \u02dcT ) = 0, and \u02dcm > 0, then V j\n\nmax \u2286 Vmax, Ej\n\nmax \u2286 Vmax, Ej\n\nmax, Ej\n\nmax) = \u02dcm.\n\nFor the convenience of presenting our proposed algorithm, we de\ufb01ne the operation \u201cmerge\u201d\nas follows:\nDe\ufb01nition 6. Let T = (V, E) be a non-empty tree, and T1 = (V 1, E1) and T2 = (V 2, E2)\nbe two trees that satisfy: 1) they are composed of a subset of the nodes and edges of T , i.e.,\n\nof T1. We de\ufb01ne the operation \u201cmerge\u201d as \u02dcT = merge(T1, T2, T ), where \u02dcT = ( \u02dcV , \u02dcE) with\n\nV 1 \u2208 V , V 2 \u2208 V , E1 \u2208 E, and E2 \u2208 E; 2) they do not overlap, i.e., V 1T V 2 = \u2205, and\nE1T E2 = \u2205; and 3) in the tree T , vi2, the root node of T2 is a child of vi1, a leaf node\nV = V1S V2 and E = E1S E2S{(vi1 , vi2 )}.\n\nNext, we make use of Lemma 1 to e\ufb03ciently compute the maximal root-tree, and present\nthe pseudo code for Abuam in Algorithm 2. We provide the illustration of the proposed\nalgorithm and the analysis of its computational cost in the supplementary \ufb01le A.5 and A.6,\nrespectively.\n\n6\n\n\fAlgorithm 2 A Bottom-Up Algorithm with Merge: Abuam\nInput: the input tree T = (V, E)\nOutput: the maximal root-tree Tmax = (Vmax, Emax)\n1: Set T0 = (V0, E0), where V0 = {xi0} and E0 = \u2205\n2: if vi0 does not have a child node in T then\n3:\n4: end if\n5: while 1 do\n6:\n\nSet Tmax = T0, return\n\nSet \u02dcm = 0, denote by vi1, . . . , vin the n nodes that satisfy: 1) vij /\u2208 V0, 2) vij \u2208 V ,\nand 3) the parent node of vij is in V0, and denote by T j = (V j, Ej), j = 1, 2, . . . , n\nthe subtree of T rooted at vij .\nif n = 0 then\n\nSet Tmax = T0 = T , return\n\nend if\nfor j = 1 to n do\n\nSet T j\nend for\nif m(T0) = \u02dcm = 0 then\nSet Tmax = T , return\n\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20: end while\n\nend if\n\nelse\n\nmax = Abuam(T j), and \u02dcm = max(m(T j\n\nmax), \u02dcm)\n\nelse if m( \u02dcT ) > 0 and m( \u02dcT ) > \u02dcm then\n\nSet Tmax = T0, return\n\nSet T0=merge(T0, T j\n\nmax, T ), \u2200j : m(T j\n\nmax) = \u02dcm\n\nMaking use of the fact that T0 is always a valid root-tree of Tmax, the maximal root-tree of\nT , we can easily prove the following result using Lemma 1.\nTheorem 2. Tmax returned by Algorithm 2 is the maximal root-tree of the input tree T .\n\n4 Numerical Simulations\n\nE\ufb00ectiveness of the Max-Heap Structure We test the e\ufb00ectiveness of the max-heap\nstructure for linear regression b = Ax, following the same experimental setting as in [17].\nSpeci\ufb01cally, the elements of A \u2208 Rn\u00d7p are generated i.i.d. from the Gaussian distribution\nwith zero mean and standard derivation and the columns of A are then normalized to have\nunit length. The regression vector x has p = 127 nonincreasing elements, where the \ufb01rst\n10 elements are set as x\u2217\ni = 11 \u2212 i, i = 1, 2, . . . , 10 and the rest are zeros. We compared\nwith the following three approaches: Lasso [23], Group Lasso [25], and Wedge [17]. Lasso\nmakes no use of such ordering, while Wedge incorporates the structure by using an auxiliary\nordered variable. For Group Lasso and Max-Heap, we try binary-tree grouping and list-tree\ngrouping, where the associated trees are a full binary tree and a sequential list, respectively.\nThe regression vector is put on the tree so that, the closer the node to the root, the larger\nthe element is placed. In Group Lasso, the nodes appearing in the same subtree form a\ngroup. For the compared approaches, we use the implementations provided in [17]2; and for\nMax-Heap, we solve (2) with f (x) = 1\n2 +\u03c1kxk1 for some small \u03c1 = r\u00d7kAT bk\u221e (we\nset r = 10\u22124, and 10\u22128 for the binary-tree grouping and list-tree grouping, respectively) and\napply the accelerated gradient descent [19] approach with our proposed Euclidean projection.\nWe compute the average model error kx \u2212 x\u2217k2 over 50 independent runs, and report the\nresults with a varying number of sample size n in Figure 3 (a) & (b). As expected, GL-binary,\nMH-binary, Wedge, GL-list and MH-list outperform Lasso which does not incorporate such\nordering information. MH-binary performs better than GL-binary, and MH-list performs\nbetter than Wedge and GL-list, due to the direct usage of such ordering information. In\naddition, the list-tree grouping performs better than the binary-tree grouping, as it makes\nbetter usage of such ordering information.\n\n2 kAx\u2212bk2\n\n2http://www.cs.ucl.ac.uk/staff/M.Pontil/software/sparsity.html\n\n7\n\n\f \n\nLasso\nGL\u2212binary\nMH\u2212binary\n\n450\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\nr\no\nr\nr\ne\n\n \nl\n\ne\nd\no\nM\n\n0\n \n12\n\n15\n\n18\n\n20\n25\nSample size\n\n30\n\n40\n\n50\n\n(a)\n\n \n\nWedge\nGL\u2212list\nMH\u2212list\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\nr\no\nr\nr\ne\n\n \nl\n\ne\nd\no\nM\n\n0\n \n12\n\n15\n\n18\n\n20\n25\nSample size\n\n30\n\n40\n\n50\n\n(b)\n\ne\nm\nT\n\ni\n\n \nl\n\na\nn\no\n\ni\nt\n\nt\n\na\nu\np\nm\no\nC\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n10\u22125\n\n \n\ne\nm\nT\n\ni\n\n \nl\n\na\nn\no\n\ni\nt\n\nt\n\na\nu\np\nm\no\nC\n\n102\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n \n\nGaussian Distribution for v\n\n \n\nsequential list\nfull binary tree\ntree of depth 1\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\ne\nm\nT\n\ni\n\n \nl\na\nn\no\ni\nt\na\nt\nu\np\nm\no\nC\n\n104\n\n105\n\np\n\n(c)\n\n106\n\n10\u22124\n \n0\n\nUniform Distribution for v\n\n \n\nsequential list\nfull binary tree\ntree of depth 1\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\ne\nm\nT\n\ni\n\n \nl\na\nn\no\ni\nt\na\nt\nu\np\nm\no\nC\n\n104\n\n105\n\np\n\n(d)\n\n106\n\n10\u22124\n \n0\n\nGaussian Distribution, Full Binary Tree\n\n \n\nd=10\nd=12\nd=14\nd=18\nd=18\nd=20\n\nd=10\nd=12\nd=14\nd=18\nd=18\nd=20\n\n100\n\n \n\n100\n\n20\n\n40\n\n60\n\n80\n\nRandom Initialization of v\n\n(e)\n\nUniform Distribution, Full Binary Tree\n\n20\n\n40\n\n60\n\nRandom Initialization of v\n\n80\n\n(f)\n\nFigure 3: Simulation results. In plots (a) and (b), we show the average model error kx \u2212 x\u2217k2\nover 50 independet runs by di\ufb00erent approaches with the full binary-tree ordering and the list-tree\nordering. In plots (c) and (d), we report the computational time (in seconds) of the proposed Atda\n(averaged over 100 runs) with di\ufb00erent randomly initialized input v. In plots (e) and (f), we show\nthe computational time of Atda over 100 runs.\n\nE\ufb03ciency of the Proposed Projection We test the e\ufb03ciency of the proposed Atda\napproach for solving the Euclidean projection onto the non-negative max-heap, equipped\nwith our proposed Abuam approach for \ufb01nding the maximal root-trees. In the experiments,\nwe make use of the three tree structures as depicted in Figure 1, and try two di\ufb00erent\ndistributions: 1) Gaussian distribution with zero mean and standard derivation and 2)\nuniform distribution in [0, 1] for randomly and independently generating the entries of the\ninput v \u2208 Rp. In Figure 3 (c) & (d), we report the average computational time (in seconds)\nover 100 runs under di\ufb00erent values of p = 2d+1 \u2212 1, where d = 10, 12, . . . , 20. We can\nobserve that, the proposed algorithm scales linearly with the size of p. In Figure 3 (e) & (f),\nwe report the computational time of Atda over 100 runs when the ordered tree structure is\na full binary tree. The results show that the computational time of the proposed algorithm\nis relatively stable for di\ufb00erent runs, especially for larger d or p. Note that, the source codes\nfor our proposed algorithm have been included in the SLEP package [13].\n\n5 Conclusion\n\nIn this paper, we have developed an e\ufb03cient algorithm for the computation of the Euclidean\nprojection onto a non-negative max-heap. The proposed algorithm has a (worst-case) linear\ntime complexity for a sequential list, and O(p2) for a general tree. Empirical results show\nthat: 1) the proposed approach deals with the ordering information better than existing\napproaches, and 2) the proposed algorithm has an expected linear time complexity for the\nsequential list, the full binary tree, and the tree of depth 1. It will be interesting to explore\nwhether the proposed Abuam has a worst case linear (or linearithmic) time complexity for\nthe binary tree. We plan to apply the proposed algorithms to real-world applications with\nan ordered tree structure. We also plan to extend our proposed approaches to the general\nhierarchical structure.\n\nAcknowledgments\n\nThis work was supported by NSF IIS-0812551, IIS-0953662, MCB-1026710, CCF-1025177, NGA\nHM1582-08-1-0016, and NSFC 60905035, 61035003.\n\n8\n\n\fReferences\n\n[1] E. Berg, M. Schmidt, M. P. Friedlander, and K. Murphy. Group sparsity via linear-time\nprojection. Tech. Rep. TR-2008-09, Department of Computer Science, University of British\nColumbia, Vancouver, July 2008.\n\n[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[3] N. Choi, W. Li, and J. Zhu. Variable selection with the strong heredity constraint and its\n\noracle property. Journal of the American Statistical Association, 105:354\u2013364, 2010.\n\n[4] Z. Dost\u00b4al. Box constrained quadratic programming with proportioning and projections. SIAM\n\nJournal on Optimization, 7(3):871\u2013887, 1997.\n\n[5] J. Duchi, S. Shalev-Shwartz, Y. Singer, and C. Tushar. E\ufb03cient projection onto the \u21131-ball\n\nfor learning in high dimensions. In International Conference on Machine Learning, 2008.\n\n[6] M. Hamada and C. Wu. Analysis of designed experiments with complex aliasing. Journal of\n\nQuality Technology, 24:130\u2013137, 1992.\n\n[7] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity.\n\nIn International\n\nConference on Machine Learning. 2009.\n\n[8] L. Jacob, G. Obozinski, and J. Vert. Group lasso with overlap and graph lasso. In International\n\nConference on Machine Learning, 2009.\n\n[9] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing\n\nnorms. Technical report, arXiv:0904.3523v2, 2009.\n\n[10] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical\n\ndictionary learning. In International Conference on Machine Learning, 2010.\n\n[11] S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured\n\nsparsity. In International Conference on Machine Learning, 2010.\n\n[12] X. Li, N. Sundarsanam, and D. Frey. Regularities in data from factorial experiments. Com-\n\nplexity, 11:32\u201345, 2006.\n\n[13] J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with E\ufb03cient Projections. Arizona State\n\nUniversity, 2009.\n\n[14] J. Liu and J. Ye. E\ufb03cient Euclidean projections in linear time. In International Conference\n\non Machine Learning, 2009.\n\n[15] J. Liu and J. Ye. Moreau-yosida regularization for grouped tree structure learning. In Advances\n\nin Neural Information Processing Systems, 2010.\n\n[16] R. Luss, S. Rosset, and M. Shahar. Decomposing isotonic regression for e\ufb03ciently solving large\n\nproblems. In Advances in Neural Information Processing Systems, 2010.\n\n[17] C. Micchelli, J. Morales, and M. Pontil. A family of penalty functions for structured sparsity.\n\nIn Advances in Neural Information Processing Systems 23, pages 1612\u20131623. 2010.\n\n[18] J. Nelder. The selection of terms in response-surface models\u2014how strong is the weak-heredity\n\nprinciple? Annals of Applied Statistics, 52:315\u2013318, 1998.\n\n[19] A. Nemirovski. E\ufb03cient methods in convex programming. Lecture Notes, 1994.\n\n[20] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Aca-\n\ndemic Publishers, 2004.\n\n[21] P. M. Pardalos and G. Xue. Algorithms for a class of isotonic regression problems. Algorithmica,\n\n23:211\u2013222, 1999.\n\n[22] S. Shalev-Shwartz and Y. Singer. E\ufb03cient learning of label ranking by soft projections onto\n\npolyhedra. Journal of Machine Learning Research, 7:1567\u20131599, 2006.\n\n[23] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety Series B, 58(1):267\u2013288, 1996.\n\n[24] M. Yuan, V. R. Joseph, and H. Zou. Structured variable selection and estimation. Annals of\n\nApplied Statistics, 3:1738\u20131757, 2009.\n\n[25] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and\n\nhierarchical variable selection. Annals of Statistics, 37(6A):3468\u20133497, 2009.\n\n[26] L.W. Zhong and J.T. Kwok. E\ufb03cient sparse modeling with automatic feature grouping. In\n\nInternational Conference on Machine Learning, 2011.\n\n9\n\n\f", "award": [], "sourceid": 356, "authors": [{"given_name": "Jun", "family_name": "Liu", "institution": null}, {"given_name": "Liang", "family_name": "Sun", "institution": null}, {"given_name": "Jieping", "family_name": "Ye", "institution": null}]}