{"title": "Nonparametric regression and classification with joint sparsity constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 969, "page_last": 976, "abstract": "We propose new families of models and algorithms for high-dimensional nonparametric learning with joint sparsity constraints. Our approach is based on a regularization method that enforces common sparsity patterns across different function components in a nonparametric additive model. The algorithms employ a coordinate descent approach that is based on a functional soft-thresholding operator. The framework yields several new models, including multi-task sparse additive models, multi-response sparse additive models, and sparse additive multi-category logistic regression. The methods are illustrated with experiments on synthetic data and gene microarray data.", "full_text": "Nonparametric Regression and Classi\ufb01cation with\n\nJoint Sparsity Constraints\n\nHan Liu John Lafferty Larry Wasserman\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nWe propose new families of models and algorithms for high-dimensional nonpara-\nmetric learning with joint sparsity constraints. Our approach is based on a regular-\nization method that enforces common sparsity patterns across different function\ncomponents in a nonparametric additive model. The algorithms employ a coor-\ndinate descent approach that is based on a functional soft-thresholding operator.\nThe framework yields several new models, including multi-task sparse additive\nmodels, multi-response sparse additive models, and sparse additive multi-category\nlogistic regression. The methods are illustrated with experiments on synthetic data\nand gene microarray data.\n\n1 Introduction\n\nMany learning problems can be naturally formulated in terms of multi-category classi\ufb01cation or\nmulti-task regression. In a multi-category classi\ufb01cation problem, it is required to discriminate be-\ntween the different categories using a set of high-dimensional feature vectors\u2014for instance, clas-\nsifying the type of tumor in a cancer patient from gene expression data. In a multi-task regression\nproblem, it is of interest to form several regression estimators for related data sets that share common\ntypes of covariates\u2014for instance, predicting test scores across different school districts. In other ar-\neas, such as multi-channel signal processing, it is of interest to simultaneously decompose multiple\nsignals in terms of a large common overcomplete dictionary, which is a multi-response regression\nproblem. In each case, while the details of the estimators vary from instance to instance, across\ncategories, or tasks, they may share a common sparsity pattern of relevant variables selected from a\nhigh-dimensional space. How to \ufb01nd this common sparsity pattern is an interesting learning task.\nIn the parametric setting, progress has been recently made on such problems using regularization\nbased on the sum of supremum norms (Turlach et al., 2005; Tropp et al., 2006; Zhang, 2006). For\nexample, consider the K-task linear regression problem y(k)\ni where\nthe superscript k indexes the tasks, and the subscript i = 1, . . . , nk indexes the instances within a\ntask. Using quadratic loss, Zhang (2006) suggests the following estimator\n\ni = \u03b2(k)\n\nij + \u03f5(k)\n\n0 +\n\nj=1 \u03b2(k)\n\nj x(k)\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 K\u2211\n\nk=1\n\n\uf8ee\uf8ef\uf8f0 1\n\n2nk\n\nnk\u2211\n\n\uf8eb\uf8edy(k)\n\ni\n\nb\u03b2 = arg min\n\n(cid:12)\n\n\u2212 p\u2211\n\n\u2212 \u03b2(k)\n\n0\n\n\u03b2(k)\nj x(k)\n\nij\n\ni=1\n\nj=1\n\nj=1\n\nj\n\n| = \u2225\u03b2j\u2225\u221e is the sup-norm of the vector \u03b2j \u2261 (\u03b2(1)\n\nwhere maxk |\u03b2(k)\n)T of coef\ufb01cients\nfor the jth feature across different tasks. The sum of sup-norms regularization has the effect of\n\u201cgrouping\u201d the elements in \u03b2j such that they can be shrunk towards zero simultaneously. The\nproblems of multi-response (or multivariate) regression and multi-category classi\ufb01cation can be\nviewed as a special case of the multi-task regression problem where tasks share the same design\nmatrix. Turlach et al. (2005) and Fornasier and Rauhut (2008) propose the same sum of sup-norms\n\n, . . . , \u03b2(K)\n\nj\n\nj\n\n1\n\np\n\n\u2211\n\uf8f9\uf8fa\uf8fb + \u03bb\np\u2211\n\n\uf8f6\uf8f82\n\n\uf8fc\uf8f4\uf8fd\uf8f4\uf8fe\n\n|\u03b2(k)\n\nj\n\nmax\n\nk\n\n|\n\n(1)\n\n\fregularization as in (1) for such problems in the linear model setting. In related work, Zhang et al.\n(2008) propose the sup-norm support vector machine, demonstrating its effectiveness on gene data.\nIn this paper we develop new methods for nonparametric estimation for such multi-task and multi-\ncategory regression and classi\ufb01cation problems. Rather than \ufb01tting a linear model, we instead esti-\nmate smooth functions of the data, and formulate a regularization framework that encourages joint\nfunctional sparsity, where the component functions can be different across tasks while sharing a\ncommon sparsity pattern. Building on a recently proposed method called sparse additive models,\nor \u201cSpAM\u201d (Ravikumar et al., 2007), we propose a convex regularization functional that can be\nviewed as a nonparametric analog of the sum of sup-norms regularization for linear models. Based\non this regularization functional, we develop new models for nonparametric multi-task regression\nand classi\ufb01cation, including multi-task sparse additive models (MT-SpAM), multi-response sparse\nadditive models (MR-SpAM), and sparse multi-category additive logistic regression (SMALR).\nThe contributions of this work include (1) an ef\ufb01cient iterative algorithm based on a functional\nsoft-thresholding operator derived from subdifferential calculus, leading to the multi-task and multi-\nresponse SpAM procedures, (2) a penalized local scoring algorithm that corresponds to \ufb01tting a\nsequence of multi-response SpAM estimates for sparse multi-category additive logistic regression,\nand (3) the successful application of this methodology to multi-category tumor classi\ufb01cation and\nbiomarker discovery from gene microarray data.\n\n2 Nonparametric Models for Joint Functional Sparsity\n\n\u222b\n\n\u2211\n\nn\n\nn = 1\n\nn\nj=1 v2\n\nWe begin by introducing some notation. If X has distribution PX, and f is a function of x, its\nL2(PX) norm is denoted by \u2225f\u22252 =\nX f 2(x)dPX = E(f 2). If v = (v1, . . . , vn)T is a vector, de-\nj and \u2225v\u2225\u221e = maxj |vj|. For a p-dimensional random vector (X1, . . . , Xp),\n\ufb01ne \u2225v\u22252\nlet Hj denote the Hilbert subspace L2(PXj ) of PXj -measurable functions fj(xj) of the single scalar\n\u2211\nvariable Xj with zero mean, i.e. E[fj(Xj)] = 0. The inner product on this space is de\ufb01ned as\n\u27e8fj, gj\u27e9 = E [fj(Xj)gj(Xj)]. In this paper, we mainly study multivariate functions f(x1, . . . , xp)\nj fj(xj), with fj \u2208 Hj for j = 1, . . . , p.\nthat have an additive form, i.e., f(x1, . . . , xp) = \u03b1 +\nWith H \u2261 {1} \u2295 H1 \u2295 H2 \u2295 . . . \u2295 Hp denoting the direct sum Hilbert space, we have that f \u2208 H.\n\ni\n\ni\n\n, y(k)\n\ni = (x(k)\n\ni1 , . . . , x(k)\n\n2.1 Multi-task/Multi-response Sparse Additive Models\n), i = 1, . . . , nk, k = 1, . . . , K},\nIn a K-task regression problem, we have observations {(x(k)\nwhere x(k)\nip )T is a p-dimensional covariate vector, the superscript k indexes tasks\nand i indexes the i.i.d. samples for each task. In the following, for notational simplicity, we assume\nthat n1 = . . . = nK = n. We also assume different tasks are comparable and each Y (k) and\nX (k)\nhas been standardized, i.e., has mean zero and variance one. This is not really a restriction\n\u2211\nof the model since a straightforward weighting scheme can be adopted to extend our approach to\n= f (k)(x(k)) \u2261\nhandle noncomparable tasks. We assume the true model is E\nj=1 f (k)\n) for k = 1, . . . , K, where, for simplicity, we take all intercepts \u03b1(k) to be zero. Let\nf (k)(x, y) = (y \u2212 f (k)(x))2 denote the quadratic loss. To encourage common sparsity patterns\nQ\nacross different function components, we de\ufb01ne the regularization functional (cid:8)K(f) by\n\n(\nY (k) | X (k) = x(k)\n\n)\n\n(x(k)\n\nj\n\nj\n\np\n\nj\n\np\u2211\n\nj=1\n\n(cid:8)K(f) =\n\nmax\n\nk=1;:::;K\n\n\u2225f (k)\n\nj\n\n\u2225.\n\n(2)\n\nThe regularization functional (cid:8)K(f) naturally combines the idea of the sum of sup-norms penalty\nfor parametric joint sparsity and the regularization idea of SpAM for nonparametric functional spar-\nsity; if K = 1, then (cid:8)1(f) is just the regularization term introduced for (single-task) sparse additive\nmodels by Ravikumar et al. (2007). If each f (k)\nis a linear function, then (cid:8)K(f) reduces to the\nsum of sup-norms regularization term as in (1). We shall employ (cid:8)K(f) to induce joint functional\nsparsity in nonparametric multi-task inference.\n\nj\n\n2\n\n\fUsing this regularization functional, the multi-task sparse additive model (MT-SpAM) is formulated\nas a penalized M-estimator, by framing the following optimization problem\n\n}\n\nbf (1), . . . ,bf (K) = arg min\n\nf (1);:::;f (K)\n\n\u2208 H(k)\n\n{\n\nn\u2211\n\nK\u2211\n\ni=1\n\nk=1\n\n1\n2n\n\nQ\n\nf (k)(x(k)\n\ni\n\n, y(k)\n\ni\n\n) + \u03bb(cid:8)K(f)\n\n(3)\n\nj\n\nwhere f (k)\nfor j = 1, . . . , p and k = 1, . . . , K, and \u03bb > 0 is a regularization parameter.\nThe multi-response sparse additive model (MR-SpAM) has exactly the same formulation as in (3)\nexcept that a common design matrix is used across the K different tasks.\n\nj\n\n2.2 Sparse Multi-Category Additive Logistic Regression\n\n(\n\n)\n\n(\n\n\u2211\n\nf (k)(x)\n\nexp\nK\u22121\nk\u2032=1 exp\n\nIn a K-category classi\ufb01cation problem, we are given n examples (x1, y1), . . . , (xn, yn) where xi =\n)T is a (K \u2212 1)-\n(xi1, . . . , xip)T is a p-dimensional predictor vector and yi = (y(1)\ndimensional response vector in which at most one element can be one, with all the others being\nzero. Here, we adopt the common \u201c1-of-K\u201d labeling convention where y(k)\ni = 1 if xi has category\nk and y(k)\nThe multi-category additive logistic regression model is\n\ni = 0 otherwise; if all elements of yi are zero, then xi is assigned the K-th category.\n\n, . . . , y(K\u22121)\n\ni\n\ni\n\n\u2211\n\nP(Y (k) = 1| X = x) =\n\nj\n\np\n\n1 +\n\nf (k\u2032)(x)\n\n(4)\n(xj) has an additive form. We de\ufb01ne f = (f (1), . . . , f (K\u22121)) to\nj=1 f (k)\nwhere f (k)(x) = \u03b1(k) +\nf (x) = P(Y (k) = 1| X = x) to be the conditional probability of\nbe a discriminant function and p(k)\ncategory k given X = x. The logistic regression classi\ufb01er hf (\u00b7) induced by f, which is a mapping\nfrom the sample space to the category labels, is simply given by hf (x) = arg maxk=1;:::;K p(k)\nf (x).\nIf a variable Xj is irrelevant, then all of the component functions f (k)\nare identically zero, for each\nk = 1, 2, . . . , K \u2212 1. This motivates the use of the regularization functional (cid:8)K\u22121(f) to zero out\nentire vectors fj = (f (1)\nDenoting\n\n).\n\nj\n\nj\n\n)\n\n(\n\nj\n\n, . . . , f (K\u22121)\nK\u22121\u2211\n\n) , k = 1, . . . , K \u2212 1\n\nK\u22121\u2211\n\nk\u2032=1\n\n1 +\n\nexp f (k\n\n\u2032\n\n)(x)\n\nas the multinomial\n(SMALR) is thus formulated as the solution to the optimization problem\n\nlog-loss,\n\nk=1\nthe sparse multi-category additive logistic regression estimator\n\n}\n\n\u2113f (x, y) =\n\ny(k)f (k)(x) \u2212 log\n{\n\nbf (1), . . . ,bf (K\u22121) = arg min\n\nf (1);:::;f (K\u22121)\n\nn\u2211\n\ni=1\n\n\u2212 1\nn\n\n\u2113f (xi, yi) + \u03bb(cid:8)K\u22121(f)\n\n(5)\n\nwhere f (k)\n\nj\n\n\u2208 H(k)\n\nj\n\nfor j = 1, . . . , p and k = 1, . . . , K \u2212 1.\n\n3 Simultaneous Sparse Back\ufb01tting\n\nWe use a blockwise coordinate descent algorithm to minimize the functional de\ufb01ned in (3). We \ufb01rst\nformulate the population version of the problem by replacing sample averages by their expectations.\nWe then derive stationary conditions for the optimum and obtain a population version algorithm for\ncomputing the solution by a series of soft-thresholded univariate conditional expectations. Finally,\na \ufb01nite sample version of the algorithm can be derived by plugging in nonparametric smoothers for\nthese conditional expectations.\n\nFor the jth block of component functions f (1)\n) de-\nnote the partial residuals. Assuming all but the functions in the jth block to be \ufb01xed, the optimization\nproblem is reduced to\n\n, . . . , f (K)\n\nl\u0338=j f (k)\n\n, let R(k)\n\n(X (k)\n\nj\n\nj\n\nl\n\nl\n\n}\n\nbf (1)\n\nj\n\n, . . . ,bf (K)\n\nj = arg min\n\n(1)\nf\nj\n\n;:::;f\n\n(K)\nj\n\n{\n\n[\n\n(\n\nK\u2211\n\nk=1\n\nE\n\n1\n2\n\n\u2212 f (k)\n\nj\n\n(X (k)\n\nj\n\n\u2225f (k)\n\nj\n\n\u2225\n\n.\n\n(6)\n\nj = Y (k) \u2212\u2211\n]\n)2\n\n)\n\n+ \u03bb max\n\nk=1;:::;K\n\nR(k)\n\nj\n\n3\n\n\fThe following result characterizes the solution to (6).\n\nTheorem 1. Let P (k)\ns(k1)\nj\n\nj = E\n\u2265 . . . \u2265 s(kK )\n\n\u2265 s(k2)\n\nj\n\nj\n\n(\n\n)\n\n.Thenthesolutionto (6) isgivenby\n\nj\n\nj\n\n| X (k)\n[\n\u2217\u2211\n\nR(k)\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3P (ki)\n\nj\n1\nm\u2217\n\nm\n\nand s(k)\n\nj\n\nj = \u2225P (k)\n]\n\ns(ki\u2032 )\n\nj\n\n\u2212 \u03bb\n\n)\n\nP (ki)\nj\ns(ki)\nj\n\n+\n\ni\u2032=1\n\u2212 \u03bb\n\nf (ki)\nj =\n\n(\u2211\n\n\u2225,andordertheindicesaccordingto\n\n\u2217\nfori > m\nfori \u2264 m\n\u2217\n\n.\n\n(7)\n\nwherem\n\n\u2217 = argmaxm\n\n1\nm\n\nm\n\ni\u2032=1 s(ki\u2032 )\n\nj\n\nand[\u00b7]+ denotesthepositivepart.\n\nTherefore, the optimization problem in (6) is solved by a soft-thresholding operator, given in equa-\ntion (7), which we shall denote as\n\n(f (1)\n\n(8)\nWhile the proof of this result is lengthy, we sketch the key steps below, which are a functional ex-\ntension of the subdifferential calculus approach of Fornasier and Rauhut (2008) in the linear setting.\nFirst, we formulate an optimality condition in terms of the G\u02c6ateaux derivative as follows.\n\n, . . . , R(K)\n\n, . . . , f (K)\n\n[R(1)\n\n].\n\n(cid:21)\n\nj\n\nj\n\nj\n\nj\n\n) = Soft(\u221e)\n\nLemma 2. Thefunctionsf (k)\nsurely),fork = 1, . . . , K,whereuk arescalarsandvk aremeasurablefunctionsofX (k)\n\u2225, k = 1, . . . , K.\n\nj aresolutionsto (6) ifandonlyiff (k)\n\n(u1, . . . , uK)T \u2208 \u2202\u2225 \u00b7 \u2225\u221e\n\n\u2225(cid:17)T and vk \u2208 \u2202\u2225f (k)\n\n(cid:12)(cid:12)(cid:16)\u2225f\n\n\u2225;:::;\u2225f\n\nj\n\nj\n\n(1)\nj\n\n(K)\nj\n\nj ,with\n\nj + \u03bbukvk = 0 (almost\n\n\u2212 P (k)\n\nj\n\nHere the former one denotes the subdifferential of the convex functional \u2225 \u00b7 \u2225\u221e evaluated at\n\u2225)T , it lies in a K-dimensional Euclidean space. And the latter denotes the sub-\n(\u2225f (1)\n\u2225, . . . ,\u2225f (K)\ndifferential of \u2225f (k)\n\u2225, which is a set of functions. Next, the following proposition from Rockafellar\nand Wets (1998) is used to characterize the subdifferential of sup-norms.\nLemma 3. Thesubdifferentialof\u2225 \u00b7 \u2225\u221e onRK is\n\nj\n\nj\n\n\u2202\u2225 \u00b7 \u2225\u221e\n\nx =\n\nifx = 0\nB1(1)\nconv{sign(xk)ek : |xk| = \u2225x\u2225\u221e} otherwise.\n\n{\n\n(cid:12)(cid:12)\n\nwhereB1(1) denotesthe\u21131 ballofradiusone,conv(A) denotestheconvexhullofsetA,andek is\nthek-thcanonicalunitvectorinRK.\n\nUsing Lemma 2 and Lemma 3, the proof of Theorem 1 proceeds by considering three cases for the\n\u2225 = 0 for k = 1, . . . , K; (2)\nsup-norm subdifferential evaluated at (\u2225f (1)\n\u2225 \u0338= 0; (3) there exists at least two\nthere exists a unique k, such that \u2225f (k)\n\u2225 \u0338= 0. The derivations for cases (1) and\nk \u0338= k\n(2) are relatively straightforward, but for case (3) we prove the following.\n\n\u2225 = maxm=1;:::;K \u2225f (m)\n\n\u2032, such that \u2225f (k)\n\n\u2225)T : (1) \u2225f (k)\n\n\u2225, . . . ,\u2225f (K)\n\n\u2225 = \u2225f (k\n\nj\n\nj\n\nj\n\nj\n\nj\n\nj\n\n)\n\n)\n\n\u2032\n\n\u2032\n\nj\n\n\u2225 = maxk\u2032=1;:::;K \u2225f (k\n)\n\nj\n\nLemma 4. Thesup-normisattainedpreciselyat m > 1 entriesifonlyif m isthelargestnumber\nsuchthats(km)\n\n\u2212 \u03bb\n\n.\n\nm\u22121\ni\u2032=1 s(ki\u2032 )\n\nj\n\n\u2265 1\nm\u22121\n\nj\n\n(\u2211\n\nThe proof of Theorem 1 then follows from the above lemmas and some calculus. Based on this\nresult, the data version of the soft-thresholding operator is obtained by replacing the conditional\nj = E(R(k)\nexpectation P (k)\nis a nonparametric smoother for\nvariable X (k)\n, e.g., a local linear or spline smoother; see Figure 1. The resulting simultaneous\nsparse back\ufb01tting algorithm for multi-task and multi-response sparse additive models (MT-SpAM\nand MR-SpAM) is shown in Figure 2. The algorithm for the multi-response case (MR-SpAM) has\nj = . . . = S (K)\nS (1)\n\nsince there is only a common design matrix.\n\n, where S (k)\n\n) by S (k)\n\n| X (k)\n\nj R(k)\n\nj\n\nj\n\nj\n\nj\n\nj\n\nj\n\n4\n\n\fSOFT-THRESHOLDING OPERATOR SOFT(\u221e)\n\n(cid:21)\n\n[R(1)\n\nj\n\n, . . . , R(K)\n\nj\n\n;S (1)\n\nj\n\n, . . . ,S (K)\n\nj\n\n]: DATA VERSION\n\nfor k = 1; : : : ; K, regularization parameter (cid:21).\n\n\u2265bs(k2)\n\nj\n\n\u2265 : : : \u2265bs(kK )\n\nj\n\n;\n\nj\n\nj\n\nj\n\nj\n\nj\n\n;\n\n\u2217\n\nj R(k)\n\n(3) Find m\n\n= arg maxm\n\n, residuals R(k)\n| X (k)\n\nInput: Smoothing matrices S (k)\n(1) Estimate P (k)\n(2) Estimate norm:bs(k)\n\nj = EhR(k)\nj = \u2225bPj\u2225n and order the indices according tobs(k1)\n(cid:16)Pm\ni\u2032=1 s(ki\u2032 )\n8>><>>:\nbP (ki)\n\" m\n\u2217X\nbf (ki)\ni\u2032=1bs(ki\u2032 )\n(4) Center bf (k)\n\u2212 mean(bf (k)\n\u2190 bf (k)\nOutput: Functions bf (k)\n\ni by smoothing: bP (k)\nj = S (k)\n\u2212 (cid:21)(cid:17) and calculate\nbP (ki)\njbs(ki)\n\n) for k = 1; : : : ; K.\n\nfor k = 1; : : : ; K.\n\nj\n1\nm\u2217\n\nj =\n\n\u2212 (cid:21)\n\n#\n\n1\nm\n\n+\n\nj\n\nj\n\nj\n\nj\n\nj\n\nj\n\nj\n\nj\n\n\u2217\n\nfor i > m\nfor i \u2264 m\n\u2217\n\n;\n\nFigure 1: Data version of the soft-thresholding operator.\n\nMULTI-TASK AND MULTI-RESPONSE SPAM\n\n); i = 1; : : : ; n; k = 1; : : : ; K and regularization parameter (cid:21).\n\nfor j = 1; : : : ; p and k = 1; : : : ; K;\n\nInput: Data (x(k)\n\nInitialize: Set bf (k)\n\ni\n\nIterate until convergence:\n\ni\n\n; y(k)\nj = 0 and compute smoothers S (k)\nj = y(k) \u2212Pk\u2032\u0338=j bf (k)\n\nj\n\nFor each j = 1; : : : ; p:\n(1) Compute residuals: R(k)\n\n(2) Threshold: bf (1)\n\n; : : : ;bf (K)\n\nOutput: Functions bf (k) for k = 1; : : : ; K.\n\nj\n\nj\n\n\u2190 Soft(\u221e)\n\n(cid:21)\n\n[R(1)\n\nj\n\nk\u2032\n; : : : ; R(K)\n\nj\n\nfor k = 1; : : : ; K;\n\n;S (1)\n\n; : : : ; S (K)\n\n].\n\nj\n\nj\n\nFigure 2: The simultaneous sparse back\ufb01tting algorithm for MT-SpAM or MR-SpAM. For the multi-\nresponse case, the same smoothing matrices are used for each k.\n\n3.1 Penalized Local Scoring Algorithm for SMALR\n\nWe now derive a penalized local scoring algorithm for sparse multi-category additive logistic re-\ngression (SMALR), which can be viewed as a variant of Newton\u2019s method in function space. At\neach iteration, a quadratic approximation to the loss is used as a surrogate functional with the regu-\nlarization term added to induce joint functional sparsity. However, a technical dif\ufb01culty is that the\napproximate quadratic problem in each iteration is weighted by a non-diagonal matrix in function\nspace, thus a trivial extension of the algorithm in (Ravikumar et al., 2007) for sparse binary non-\nparametric logistic regression does not apply. To tackle this problem, we use an auxiliary function\nto lower bound the log-loss, as in (Krishnapuram et al., 2005).\nThe population version of the log-loss is L(f) = E[\u2113f (X, Y )] with f = (f (1), . . . , f (K\u22121)). A\n\nsecond-order Lagrange form Taylor expansion to L(f) at bf is then\nfor some function ef, where the gradient is \u2207L(bf) = Y \u2212 pbf (X) with pbf (X) = (pbf (Y (1) =\n1| X), . . . , pbf (Y (K\u22121) = 1| X))T , and the Hessian is H(ef) = \u2212diag\nDe\ufb01ning B = \u2212(1/4)IK\u22121, it is straightforward to show that B \u227c H(ef), i.e., H(ef) \u2212 B is\n+ pef (X)pef (X)T .\n]\n\n[\n(f \u2212 bf)T H(ef)(f \u2212 bf)\n)\npef (X)\n(f \u2212 bf)T B(f \u2212 bf)\n\n[\n]\n\u2207L(bf)T (f \u2212 bf)\n\n[\n\u2207L(bf)T (f \u2212 bf)\n\nL(f) \u2265 L(bf) + E\n\nL(f) = L(bf) + E\n\npositive-de\ufb01nite. Therefore, we have that\n\n(\n\n]\n\n]\n\n[\n\n(9)\n\n1\n2\n\n(10)\n\n+\n\n+\n\nE\n\nE\n\n.\n\n1\n2\n\n5\n\n\fSMALR: SPARSE MULTI-CATEGORY ADDITIVE LOGISTIC REGRESSION\n\nInput: Data (xi; yi); i = 1; : : : ; n and regularization parameter (cid:21).\n\nInitialize: bf (k)\n\nj = 0 andb(cid:11)(k) = log(cid:16)Pn\n\nIterate until convergence:\n\ni=1 y(k)\n\ni\n\ni=1PK\u22121\n\nk\u2032=1 y(k\n\ni\n\n(cid:17)(cid:17), k = 1; : : : ; K \u2212 1\n\n\u2032\n\n)\n\n.(cid:16)n \u2212Pn\ni = 4(cid:16)y(k)\n\ni\n\n(xi) \u2261 P(Y (k) = 1| X = xi) as in (4) for k = 1; : : : ; K \u2212 1;\n(xi)(cid:17)+b(cid:11)(k) +Pp\n2(cid:21)(cid:17);\n\n(1) Compute p(k)bf\n\u2212 p(k)bf\n(2) Calculate the transformed responses Z (k)\n(3) Call subroutines (bf (1); : : : ;bf (K\u22121)) \u2190 MR-SpAM(cid:16)(xi; Z (k)\nfor k = 1; : : : ; K \u2212 1 and i = 1; : : : ; n;\nnX\nOutput: Functions bf (k) and interceptsb(cid:11)(k) for k = 1; : : : ; K \u2212 1.\n\n(4) Adjust the intercepts: (cid:11)(k) \u2190 1\nn\n\n)n\ni=1;\n\nZ (k)\n\n\u221a\n\ni=1\n\n;\n\ni\n\ni\n\nFigure 3: The penalized local scoring algorithm for SMALR.\n\nj=1 bf (k)\n\nj\n\n(xij)\n\nThe following lemma results from straightforward calculation.\nLemma 5. Thesolution f thatmaximizestherighthandsideof (10) isequivalenttothesolution\nthatminimizes 1\n2\n\n\u22121(Y \u2212 pbf ) + Abf.\n\n)\n(\u2225Z \u2212 Af\u22252\n\u2211\nwhereA = (\u2212B)1=2 andZ = A\n[(\nK\u22121\u2211\n\nRecalling that f (k) = \u03b1(k) +\nauxiliary functional\n\n, equation (9) and Lemma 5 then justify the use of the\n\nj=1 f (k)\n\n]\n\nE\n\nn\n\np\n\nj\n\n\u2032(k) = 4\n\n2\u03bb. This is\nwhere Z\nprecisely in the form of a multi-response SpAM optimization problem in equation (3). The resulting\nalgorithm, in the \ufb01nite sample case, is shown in Figure 3.\n\n(Xj) and \u03bb\n\np\nj=1\n\nj\n\n\u2032(cid:8)K\u22121(f)\n\nj=1 f (k)(Xj)\n\n+b\u03b1(k) +\n\n+ \u03bb\n\nbf (k)\n\n\u221a\n\n\u2032 =\n\n(11)\n\n\u2032(k) \u2212\u2211\n)\nY (k) \u2212 Pbf (Y (k) = 1| X)\n\n1\n2\n\nk=1\n\nE\n\nZ\n\np\n\n(\n\n)2\n\u2211\n\n4 Experiments\n\nIn this section, we \ufb01rst use simulated data to investigate the performance of the MT-SpAM simulta-\nneous sparse back\ufb01tting algorithm. We then apply SMALR to a tumor classi\ufb01cation and biomarker\nidenti\ufb01cation problem. In all experiments, the data are rescaled to lie in the p-dimensional cube\n[0, 1]p. We use local linear smoothing with a Gaussian kernel. To choose the regularization param-\neter \u03bb, we simply use J-fold cross-validation or the GCV score from (Ravikumar et al., 2007) ex-\n))/(n2K 2\u2212(nK)df(\u03bb))2\ntended to the multi-task setting: GCV(\u03bb) =\n) is the effective degrees\nwhere df(\u03bb) =\nof freedom for the univariate local linear smoother on the jth variable.\n\n\u2211\n\u2211\n)\n\u2225n \u0338= 0\n\nQbf (k)(x(k)\n, y(k)\nj = trace(S (k)\n\n\u2225bf (k)\n\nj=1 \u03bd(k)\nj I\n\n, and \u03bd(k)\n\n\u2211\n\n\u2211\n\n(\n\nK\nk=1\n\nK\nk=1\n\nn\ni=1\n\np\n\nj\n\nj\n\ni\n\ni\n\n4.1 Synthetic Data\n\n\u2211\n\nj\n\nj\n\n4\n\n(x(k)\n\ni =\n\nj=1 f (k)\n\nij ) + \u03f5(k)\n\n, k = 1, 2, 3, where \u03f5(k)\n\nWe generated n = 100 observations from a 10-dimensional three-task additive model with four\n\u223c N (0, 1); the com-\nrelevant variables: y(k)\nponent functions f (k)\nare plotted in Figure 4. The 10-dimensional covariates are generated as\nX (k)\nj = (W (k)\n10 and U (k) are i.i.d. sampled\nfrom Uniform(\u22122.5, 2.5). Thus, the correlation between Xj and Xj\u2032 is t2/(1 + t2) for j \u0338= j\nThe results of applying MT-SpAM with the bandwidths h = (0.08, . . . , 0.08) and regularization\nparameter \u03bb = 0.25 are summarized in Figure 4. The upper 12 \ufb01gures show the 12 relevant com-\nponent functions for the three tasks; the estimated function components are plotted as solid black\n\nj + tU (k))/(1 + t), j = 1, . . . , 10 where W (k)\n\n, . . . , W (k)\n\n\u2032.\n\n1\n\ni\n\ni\n\n6\n\n\fVariable selection accuracy\n\nEstimation accuracy: MSE (sd)\n\nModel\nMR\u2212SpAM\nMARS\n\nt = 0\n\nt = 1\n\nt = 2\n\nt = 3\n\nt = 0\n\nt = 1\n\nt = 2\n\nt = 3\n\n89\n0\n\n80\n0\n\n47\n0\n\n37\n0\n\n7:43 (0:71)\n8:66 (0:78)\n\n5:82 (0:60)\n7:52 (0:61)\n\n3:83 (0:37)\n5:36 (0:40)\n\n3:07 (0:30)\n4:64 (0:35)\n\nFigure 4: (Top) Estimated vs. true functions from MT-SpAM; (Middle) Regularization paths using MT-SpAM.\n(Bottom) Quantitative comparison between MR-SpAM and MARS\nlines and the true function components are plotted using dashed red lines. For all the other variables\n(from dimension 5 to dimension 10), both the true and estimated components are zero. The middle\nthree \ufb01gures show regularization paths as the parameter \u03bb varies; each curve is a plot of the max-\nimum empirical L1 norm of the component functions for each variable, with the red vertical line\nrepresenting the selected model using the GCV score. As the correlation increases (t increases), the\nseparation between the relevant dimensions and the irrelevant dimensions becomes smaller. Using\nthe same setup but with one common design matrix, we also compare the quantitative performance\nof MR-SpAM with MARS (Friedman, 1991), which is a popular method for multi-response additive\nregression. Using 100 simulations, the table illustrates the number of times the models are correctly\nidenti\ufb01ed and the mean squared error with the standard deviation in the parentheses. (The MARS\nsimulations are carried out in R, using the default options of the mars function in the mda library.)\n\n4.2 Gene Microarray Data\n\nHere we apply the sparse multi-category additive logistic regression model to a microarray dataset\nfor small round blue cell tumors (SRBCT) (Khan et al., 2001). The data consist of expression\npro\ufb01les of 2,308 genes (Khan et al., 2001) with tumors classi\ufb01ed into 4 categories: neuroblastoma\n(NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL), and the Ewing family of tumors\n(EWS). The dataset includes a training set of size 63 and a test set of size 20. These data have\nbeen analyzed by different groups. The main purpose is to identify important biomarkers, which\nare a small set of genes that can accurately predict the type of tumor of a patient. To achieve 100%\naccuracy on the test data, Khan et al. (2001) use an arti\ufb01cial neural network approach to identify 96\ngenes. Tibshirani et al. (2002) identify a set of only 43 genes, using a method called nearest shrunken\ncentroids. Zhang et al. (2008) identify 53 genes using the sup-norm support vector machine.\nIn our experiment, SMALR achieves 100% prediction accuracy on the test data with only 20 genes,\nwhich is a much smaller set of predictors than identi\ufb01ed in the previous approaches. We follow\nthe same procedure as in (Zhang et al., 2008), and use a very simple screening step based on the\nmarginal correlation to \ufb01rst reduce the number of genes to 500. The SMALR model is then trained\nusing a plugin bandwidth h0 = 0.08, and the regularization parameter \u03bb is tuned using 4-fold cross\nvalidation. The results are tabulated in Figure 5. In the left \ufb01gure, we show a \u201cheat map\u201d of the\nselected variables on the training set. The rows represent the selected genes, with their cDNA chip\nimage id. The patients are grouped into four categories according to the corresponding tumors,\n\n7\n\n0.00.20.40.60.81.0\u22122\u22121012k=1x10.00.20.40.60.81.0\u22122\u22121012k=2x10.00.20.40.60.81.0\u2212101234k=3x10.00.20.40.60.81.0\u22122\u22121012k=1x20.00.20.40.60.81.0\u22122\u22121012k=2x20.00.20.40.60.81.0\u2212101234k=3x20.00.20.40.60.81.0\u22122\u22121012k=1x30.00.20.40.60.81.0\u22122\u22121012k=2x30.00.20.40.60.81.0\u2212101234k=3x30.00.20.40.60.81.0\u22122\u22121012k=1x40.00.20.40.60.81.0\u22122\u22121012k=2x40.00.20.40.60.81.0\u2212101234k=3x40.00.20.40.60.81.00.00.10.20.30.4t=0Path IndexEmpirical sup\u2212L1 norm101230.00.20.40.60.81.00.00.10.20.30.4t=2Path IndexEmpirical sup\u2212L1 norm106430.00.20.40.60.81.00.00.10.20.30.4t=4Path IndexEmpirical sup\u2212L1 norm78132\fFigure 5: SMALR results on gene data: heat map (left), marginal \ufb01ts (center), and CV score (right).\n\nas illustrated in the vertical groupings. The genes are ordered by hierarchical clustering of their\nexpression pro\ufb01les. The heatmap clearly shows four block structures for the four tumor categories.\nThis suggests visually that the 20 genes selected are highly informative of the tumor type. In the\nmiddle of Figure 5, we plot the \ufb01tted discriminant functions of different genes, with their image ids\nlisted on the plot. The values k = 1, 2, 3 under each sub\ufb01gure indicate the discriminant function\nthe plot represents. We see that the \ufb01tted functions are nonlinear. The right sub\ufb01gure illustrates the\ntotal number of misclassi\ufb01ed samples using 4-fold cross validation, the \u03bb values 0.3, 0.4 are both\nzero, for the purpose of a sparser biomarker set, we choose \u03bb = 0.4. Interestingly, only 10 of the\n20 identi\ufb01ed genes from our method are among the 43 genes selected using the shrunken centroids\napproach of Tibshirani et al. (2002). 16 of them are are among the 96 genes selected by neural\nnetwork approach of Khan et al. (2001). This non-overlap may suggest some further investigation.\n\n5 Discussion and Acknowledgements\n\nWe have presented new approaches to \ufb01tting sparse nonparametric multi-task regression models and\nsparse multi-category classi\ufb01cation models. Due to space constraints, we have not discussed results\non the statistical properties of these methods, such as oracle inequalities and risk consistency; these\ntheoretical results will be reported elsewhere. This research was supported in part by NSF grant\nCCF-0625879.\n\nReferences\nFORNASIER, M. and RAUHUT, H. (2008). Recovery algorithms for vector valued data with joint sparsity\n\nconstraints. SIAM Journal of Numerical Analysis 46 577\u2013613.\n\nFRIEDMAN, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics 19 1\u201367.\nKHAN, J., WEI, J. S., RINGNER, M., SAA, L. H., LADANYI, M., WESTERMANN, F., BERTHOLD, F.,\nSCHWAB, M., ANTONESCU, C. R., PETERSON, C. and MELTZER, P. S. (2001). Classi\ufb01cation and diag-\nnostic prediction of cancers using gene expression pro\ufb01ling and arti\ufb01cial neural networks. Nature Medicine\n7 673 \u2013679.\n\nKRISHNAPURAM, B., CARIN, L., FIGUEIREDO, M. and HARTEMINK, A. (2005). Sparse multinomial logistic\nregression: Fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine\nIntelligence 27 957\u2013 968.\n\nRAVIKUMAR, P., LIU, H., LAFFERTY, J. and WASSERMAN, L. (2007). SpAM: Sparse additive models. In\n\nAdvances in Neural Information Processing Systems 20. MIT Press.\n\nROCKAFELLAR, R. T. and WETS, R. J.-B. (1998). Variational Analysis. Springer-Verlag Inc.\nTIBSHIRANI, R., HASTIE, T., NARASIMHAN, B., and CHU, G. (2002). Diagnosis of multiple cancer types\n\nby shrunken centroids of gene expression. Proc Natl Acad Sci U.S.A. 99 6567\u20136572.\n\nTROPP, J., GILBERT, A. C. and STRAUSS, M. J. (2006). Algorithms for simultaneous sparse approximation.\n\nPart II: Convex relaxation. Signal Processing 86 572\u2013588.\n\nTURLACH, B., VENABLES, W. N. and WRIGHT, S. J. (2005). Simultaneous variable selection. Technometrics\n\n27 349\u2013363.\n\nZHANG, H. H., LIU, Y., WU, Y. and ZHU, J. (2008). Variable selection for the multicategory SVM via\n\nadaptive sup-norm regularization. Electronic Journal of Statistics 2 149\u20131167.\n\nZHANG, J. (2006). A probabilistic framework for multitask learning. Tech. Rep. CMU-LTI-06-006, Ph.D. the-\n\nsis, Carnegie Mellon University.\n\n8\n\n770394377461143586248611038318813474884162032518281210530823137704878422424461879625820727429644881452680649236282701751RMS.T11RMS.T10RMS.T3RMS.T5RMS.T8RMS.T7RMS.T6RMS.T2RMS.T4RMS.T1RMS.C11RMS.C10RMS.C8RMS.C7RMS.C6RMS.C5RMS.C2RMS.C9RMS.C3RMS.C4NB.C8NB.C9NB.C11NB.C10NB.C5NB.C4NB.C7NB.C12NB.C6NB.C3NB.C2NB.C1BL.C4BL.C3BL.C2BL.C1BL.C8BL.C7BL.C6BL.C5EWS.C10EWS.C11EWS.C1EWS.C7EWS.C9EWS.C6EWS.C4EWS.C2EWS.C3EWS.C8EWS.T19EWS.T15EWS.T14EWS.T13EWS.T12EWS.T11EWS.T9EWS.T7EWS.T6EWS.T4EWS.T3EWS.T2EWS.T10.00.20.40.60.81.0\u22123\u22122\u221210123ID.207274k=10.00.20.40.60.81.0\u22123\u22122\u221210123ID.1435862k=10.00.20.40.60.81.0\u22123\u22122\u221210123ID.207274k=30.00.20.40.60.81.0\u22123\u22122\u221210123ID.770394k=10.00.20.40.60.81.0\u22123\u22122\u221210123ID.377048k=20.00.20.40.60.81.0\u22123\u22122\u221210123ID.377048k=30.10.30.50.7lambdaCV Score0.00.51.01.52.02.53.0\f", "award": [], "sourceid": 329, "authors": [{"given_name": "Han", "family_name": "Liu", "institution": null}, {"given_name": "Larry", "family_name": "Wasserman", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}]}