{"title": "Group Additive Structure Identification for Kernel Nonparametric Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 4907, "page_last": 4916, "abstract": "The additive model is one of the most popularly used models for high dimensional nonparametric regression analysis. However, its main drawback is that it neglects possible interactions between predictor variables. In this paper, we reexamine the group additive model proposed in the literature, and rigorously define the intrinsic group additive structure for the relationship between the response variable $Y$ and the predictor vector $\\vect{X}$, and further develop an effective structure-penalized kernel method for simultaneous identification of the intrinsic group additive structure and nonparametric function estimation. The method utilizes a novel complexity measure we derive for group additive structures. We show that the proposed method is consistent in identifying the intrinsic group additive structure. Simulation study and real data applications demonstrate the effectiveness of the proposed method as a general tool for high dimensional nonparametric regression.", "full_text": "Group Additive Structure Identi\ufb01cation for Kernel\n\nNonparametric Regression\n\nPan Chao\n\nDepartment of Statistics\n\nPurdue University\n\nWest Lafayette, IN 47906\npanchao25@gmail.com\n\nMichael Zhu\n\nDepartment of Statistics, Purdue University\n\nWest Lafayette, IN 47906\n\nCenter for Statistical Science\n\nDepartment of Industrial Engineering\nTsinghua University, Beijing, China\n\nyuzhu@purdue.edu\n\nAbstract\n\nThe additive model is one of the most popularly used models for high dimensional\nnonparametric regression analysis. However, its main drawback is that it neglects\npossible interactions between predictor variables. In this paper, we reexamine the\ngroup additive model proposed in the literature, and rigorously de\ufb01ne the intrinsic\ngroup additive structure for the relationship between the response variable Y and\nthe predictor vector X, and further develop an effective structure-penalized kernel\nmethod for simultaneous identi\ufb01cation of the intrinsic group additive structure\nand nonparametric function estimation. The method utilizes a novel complexity\nmeasure we derive for group additive structures. We show that the proposed method\nis consistent in identifying the intrinsic group additive structure. Simulation study\nand real data applications demonstrate the effectiveness of the proposed method as\na general tool for high dimensional nonparametric regression.\n\n1\n\nIntroduction\n\nRegression analysis is popularly used to study the relationship between a response variable Y\nand a vector of predictor variables X. Linear and logistic regression analysis are arguably two\nmost popularly used regression tools in practice, and both postulate explicit parametric models on\nf (X) = E[Y |X] as a function of X. When no parametric models can be imposed, nonparametric\nregression analysis can instead be performed. On one hand, nonparametric regression analysis is\n\ufb02exible and not susceptible to model mis-speci\ufb01cation, whereas on the other hand, it suffers from a\nnumber of well-known drawbacks especially in high dimensional settings. Firstly, the asymptotic\nerror rate of nonparametric regression deteriorates quickly as the dimension of X increases. [16]\nshows that with some regularity conditions, the optimal asymptotic error rate for estimating a d-\n\ntimes differentiable function is O(cid:0)n\u2212d/(2d+p)(cid:1), where p is the dimensionality of X. Secondly, the\n\nresulting \ufb01tted nonparametric function is often complicated and dif\ufb01cult to interpret.\nTo overcome the drawbacks of high dimensional nonparametric regression, one popularly used\napproach is to impose the additive structure [5] on f (X), that is to assume that f (X) = f1(X1) +\n\u00b7\u00b7\u00b7+fp(Xp) where f1, . . . , fp are p unspeci\ufb01ed univariate functions. Thanks to the additive structure,\nthe nonparametric estimation of f or equivalently the individual fi\u2019s for 1 \u2264 i \u2264 p becomes ef\ufb01cient\nand does not suffer from the curse of dimensionality. Furthermore, the interpretability of the resulting\nmodel has also been much improved.\nThe key drawback of the additive model is that it does not assume interactions between the pre-\ndictor variables. To address this limitation, functional ANOVA models were proposed to accom-\nmodate higher order interactions, see [4] and [13]. For example, by neglecting interactions of\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fd=1\n\n(cid:80)\n\norder higher than 2, the functional ANOVA model can be written as f (X) = (cid:80)p\n(cid:80)\nf (X) =(cid:80)D\n\ni=1 fi(Xi) +\n1\u2264i,j\u2264p fij(Xi, Xj), with some marginal constraints. Another higher order interaction model,\n1\u2264i1,...,id\u2264p fj(Xi1, . . . , Xid ), is proposed by [6]. This model considers all in-\nteractions of order up to D, which is estimated by Kernel Ridge Regression (KRR) [10] with the\nelementary symmetric polynomial (ESP) kernel. Both of the two models assume the existence of\npossible interactions between any two or more predictor variables. This can lead to a serious problem,\nthat is, the number of nonparametric functions that need to be estimated quickly increases as the\nnumber of predictor variables increases.\nThere exists another direction to generalize the additive model. When proposing the Optimal Kernel\nGroup Transformation (OKGT) method for nonparametric regression, [11] considers the additive\nstructure of predictor variables in groups instead of individual predictor variables. Let G := {uj}d\nj=1\nbe a index partition of the predictor variables, that is, uj \u2229uk = \u2205 if j (cid:54)= k and \u222ad\nj=1uj = {1, . . . , p}.\nLet X uj = {Xk; k \u2208 uj} for j = 1, . . . , d. Then {X1, . . . , Xd} = X u1 \u222a \u00b7\u00b7\u00b7 \u222a X ud. For any\nfunction f (X), if there exists an index partition G = {u1, . . . , ud} such that\n\nf (X) = fu1 (X u1) + . . . + fud (X ud ),\n\n(1)\nwhere fu1(X u1), . . . , fud (X ud ) are d unspeci\ufb01ed nonparametric functions, then it is said that f (X)\nadmits the group additive structure G. We also refer to (1) as a group additive model for f (X). It is\nclear that the usual additive model is a special case with G = {(1), . . . , (p)}.\nSuppose Xj1 and Xj2 are two predictor variables. Intuitively, if Xj1 and Xj2 interact to each other,\nthen they must appear in the same group in an reasonable group additive structure of f (X). On the\nother hand, if Xj1 and Xj2 belong to two different groups, then they do not interact with each other.\nTherefore, in terms of accommodating interactions, the group additive model can be considered\nlying in the middle between the original additive model and the functional ANOVA or higher order\ninteraction models. When the group sizes are small, for example all are less than or equal to 3,\nthe group additive model can maintain the estimation ef\ufb01ciency and interpretability of the original\nadditive model while avoiding the problem of a high order model discussed earlier.\nHowever, in [11], there are two important issues not addressed. First, the group additive structure\nmay not be unique, which will lead to the nonidenti\ufb01ability problem for the group additive model.\n(See discussion in Section 2.1). Second, [11] has not proposed a systematic approach to identify the\ngroup additive structure. In this paper, we intend to resolve these two issues. To address the \ufb01rst issue,\nwe rigorously de\ufb01ne the intrinsic group additive structure for any square integrable function, which\nin some sense is the minimal group additive structure among all correct group additive structures.\nTo address the second issue, we propose a general approach to simultaneously identifying the\nintrinsic group additive structure and estimating the nonparametric functions using kernel methods\nand Reproducing Kernel Hilbert Spaces (RKHSs). For a given group additive structure G =\n{u1, . . . , ud}, we \ufb01rst de\ufb01ne the corresponding direct sum RKHS as HG = Hu1 \u2295\u00b7\u00b7\u00b7\u2295Hud where\nHui is the usual RKHS for the variables in uj only for j = 1, . . . , d. Based on the results on the\ncapacity measure of RKHSs in the literature, we derive a tractable capacity measure of the direct\nsum RKHS HG which is further used as the complexity measure of G. Then, the identi\ufb01cation of the\nintrinsic group additive structure and the estimation of the nonparametric functions can be performed\nthrough the following minimization problem\n\n\u02c6f , \u02c6G = arg min\nf\u2208HG,G\n\n1\nn\n\n(yi \u2212 f (xi))2 + \u03bb(cid:107)f(cid:107)2HG\n\n+ \u00b5C(G).\n\nWe show that when the novel complexity measure of group additive structure C(G) is used, the\nminimizer \u02c6G is consistent for the intrinsic group additive structure. We further develop two algorithms,\none uses exhaustive search and the other employs a stepwise approach, for identifying true additive\ngroup structures under the small p and large p scenarios. Extensive simulation study and real data\napplications show that our proposed method can successfully recover the true additive group structures\nin a variety of model settings.\nThere exists a connection between our proposed group additive model and graphical models ([2],\n[7]). This is especially true when a sparse block structure is imposed [9]. However, a key difference\nexists. Let\u2019s consider the following example. Y = sin(X1 + X 2\n6 ) + \u0001.\nA graphical model typically considers the conditional dependence (CD) structure among all of the\n\n2 + X3) + cos(X4 + X5 + X 2\n\nn(cid:88)\n\ni=1\n\n2\n\n\fvariables including X1, . . . , X6 and Y , which is more complex than the group additive (GA) structure\n{(X1, X2, X3), (X4, X5, X6)}. The CD structure, once known, can be further examined to infer\nthe GA structure. In this paper, we however proposed methods that directly target the GA structure\ninstead of the more complex CD structure.\nThe rest of the paper is organized as follows. In Section 2, we rigorously formulate the problem\nof Group Additive Structure Identi\ufb01cation (GASI) for nonparametric regression and propose the\nstructural penalty method to solve the problem. In Section 3, we prove the selection consistency for\nthe method. We report the experimental results based on simulation studies and real data application\nin Section 4 and 5. Section 6 concludes this paper with discussion.\n\n2 Method\n\n2.1 Group Additive Structures\n\nPX\n\n2 + X 2\n\nu(X ) := {g \u2208 L2\n\nadditive model f (X) =(cid:80)d\n\nIn the Introduction, we discussed that the group additive structure for f (X) may not be unique. Here\nis an example. Consider the model Y = 2 + 3X1 + 1/(1 + X 2\n3 ) + arcsin ((X4 + X5)/2) + \u0001,\nwhere \u0001 is the 0 mean error independent of X. According to our de\ufb01nition, this model admits\nthe group additive structure G0 = {(1) , (2, 3) , (4, 5)}. Let G1 = {(1, 2, 3) , (4, 5)} and G2 =\n{(1, 4, 5) , (2, 3)}. The model can also be said to admit G1 and G2. However, there exists a major\ndifference between G0, G1 and G2. While the groups in G0 cannot be further divided into subgroups,\nboth G1 and G2 contain groups that can be further split. We de\ufb01ne the following partial order between\ngroup structures to characterize the difference and their relationship.\nDe\ufb01nition 1. Let G and G(cid:48) be two group additive structures. If for every group u \u2208 G there is a\ngroup v \u2208 G(cid:48) such that u \u2286 v, then G is called a sub group additive structure of G(cid:48). This relation is\ndenoted as G \u2264 G(cid:48). Equivalently, G(cid:48) is a super group additive structure of G, denoted as G(cid:48) \u2265 G.\nIn the previous example, G0 is a sub group additive structure of both G1 and G2. However, the\norder between G1 and G2 is not de\ufb01ned. Let X := [0, 1]p be the p-dimensional unit cube for all\nthe predictor variables X with distribution PX. For a group of predictor variables u, we de\ufb01ne\n(X ) | g(X) = fu(X u)},\nthe space of square integrable functions as L2\nthat is L2\nu contains the functions that only depend on the variables in group u. Then the group\nj=1 fuj (X uj ) is a member of the direct sum function space de\ufb01ned as\nu(X ). Let |u| be the cardinality of the group u. If u is the only group in a group\nG(X ) := \u2295u\u2208GL2\nL2\nadditive structure and |u| = p, then L2\nThe following proposition shows that the order of two different group additive structures is preserved\nby their corresponding square integrable function spaces.\n\u2286 L2\nProposition 1. Let G1 and G2 be two group additive structures. If G1 \u2264 G2, then L2\nFurthermore, if X1, . . . , Xp are independent and G1 (cid:54)= G2, then L2\nDe\ufb01nition 2. Let f (X) be an square integrable function. For a group additive structure G, if there\nis a function fG \u2208 L2\nG such that fG = f, then G is called an amiable group additive structure for f.\nIn the example discussed in the beginning of the subsection, G0, G1 and G2 are all amiable group\nstructures. So amiable group structures may not be unique.\nProposition 2. Suppose G is an amiable group additive structure for f. If there is a second group\nadditive structure G(cid:48) such that G \u2264 G(cid:48), then G(cid:48) is also amiable for f.\nWe denote the collection of all amiable group structures for f (X) as Ga, which is partially ordered\nand complete. Therefore, there exists a minimal group additive structure in Ga, which is the most\nconcise group additive structure for the target function. We state this result as a theorem.\nTheorem 1. Let Ga be the set of amiable group additive structures for f. There is a unique minimal\ngroup additive structure G\u2217 \u2208 Ga such that G\u2217 \u2264 G for all G \u2208 Ga, where the order is given by\nDe\ufb01nition 1. G\u2217 is called the intrinsic group additive structure for f.\nFor statistical modeling, G\u2217 achieves the greatest dimension reduction for the relationship between\nY and X. It induces the smallest function space which includes the model. In general, the intrinsic\ngroup structure can help much mitigate the curse of dimensionality while improving both ef\ufb01ciency\nand interpretability of high dimensional nonparametric regression.\n\nu = L2\n\nG and f is a fully non-parametric function.\n\n\u2282 L2\n\n.\n\nG2\n\nG1\n\n.\n\nG2\n\nG1\n\n3\n\n\f(cid:40)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n2.2 Kernel Method with Known Intrinsic Group Additive Structure\nSuppose the intrinsic group additive structure for f (X) = E[Y |X] is known to be G\u2217 = {uj}d\nj=1,\nthat is, f (X) = fu1(X u1) + \u00b7\u00b7\u00b7 + fud (X ud ). Then, we will use the kernel method to estimate the\nfunctions fu1, fu2, . . ., fud. Let (Kuj ,Huj ) be the kernel and its corresponding RKHS for the j-th\ngroup uj. Then using kernel methods is to solve\n\n(yi \u2212 fG\u2217 (xi))2 + \u03bb(cid:107)fG\u2217(cid:107)2HG\u2217\n\n,\n\n(2)\n\nwhere HG\u2217 := {f =(cid:80)d\n\n\u02c6f\u03bb,G\u2217 = arg min\nfG\u2217\u2208HG\u2217\n\nj=1 fuj | fuj \u2208 Huj}. The subscripts of \u02c6f are used to explicitly indicate its\n\ndependence on the group additive structure G\u2217 and tuning parameter \u03bb.\nIn general, an RKHS is usually smaller than the L2 space de\ufb01ned on the same input domain. So, it is\nnot always true that \u02c6f\u03bb,G\u2217 achieves f. However, one can choose to use universal kernels Kuj so that\ntheir corresponding RKHSs are dense in the L2 spaces (see [3], [15]). Using universal kernels allows\n\u02c6f\u03bb,G\u2217 to not only achieve unbiasedness but also recover the group additive structure of f (X). This is\nthe fundamental reason for the consistency property of our proposed method to identify the intrinsic\ngroup additive structure. Two examples of universal kernel are Gaussian and Laplace.\n\n(cid:41)\n\n2.3\n\nIdenti\ufb01cation of Unknown Intrinsic Group Additive Structure\n\n2.3.1 Penalization on Group Additive Structures\nThe success of the kernel method hinges on knowing the intrinsic group additive structure G\u2217. In\npractice, however, G\u2217 is seldom known, and it may be of primary interest to identify G\u2217 while\nestimating a group additive model. Recall that in Subsection 2.1, we have shown that G\u2217 exists and\nis unique. The other group additive structures belong to two categories, amiable and non-amiable.\nLet\u2019s consider an arbitrary non-amiable group additive structure G \u2208 G \\ Ga \ufb01rst. If G is used in the\nplace of G\u2217 in (2), the solution \u02c6f\u03bb,G, as an estimator of f, will have a systematic bias because the L2\ndistance between any function fG \u2208 HG and the true function f will be bounded below. In other\nwords, using a non-amiable structure will result in poor \ufb01tting of the model.\nNext we consider an arbitrary amiable group additive structure G \u2208 Ga to be used in (2). Recall that\nbecause G is amiable, we have fG\u2217 = fG almost surely (in population) for the true function f (X).\nThe bias of the resulting \ufb01tted function \u02c6f\u03bb,G will vanish as the sample size increases. Although their\nasymptotic rates are in general different, under \ufb01xed sample size n, simply using goodness of \ufb01t\nwill not be able to distinguish G from G\u2217. The key difference between G\u2217 and G is their structural\ncomplexities, that is, G\u2217 is the smallest among all amiable structures (i.e. G\u2217 \u2264 G,\u2200G \u2208 Ga).\nSuppose a proper complexity measure of a group additive structure G can be de\ufb01ned (to be addressed\nin the next section) and is denoted as C(G). We can then incorporate C(G) into (2) as an additional\npenalty term and change the kernel method to the following structure-penalized kernel method.\n\n\u02c6f\u03bb,\u00b5, \u02c6G = arg min\nfG\u2208HG,G\n\n(yi \u2212 fG(xi))2 + \u03bb(cid:107)fG(cid:107)2HG\n\n+ \u00b5C(G)\n\n.\n\n(3)\n\nIt is clear that the only difference between (2) and (3) is the term \u00b5C(G). As discussed above, the\nintrinsic group additive structure G\u2217 can achieve the goodness of \ufb01t represented by the \ufb01rst two\nterms in (3) and the penalty on the structural complexity represented by the last term. Therefore,\nby properly choosing the tuning parameters, we expect that \u02c6G is consistent in that the probability\n\u02c6G = G\u2217 increases to one as n increases (see the Theory Section below). In the next section, we\nderive a tractable complexity measure for a group additive structure.\n\n(cid:40)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:41)\n\n2.3.2 Complexity Measure of Group additive Structure\nIt is tempting to propose an intuitive complexity measure for a group additive structure C(\u00b7) such\nthat C(G1) \u2264 C(G2) whenever G1 \u2264 G2. The intuition however breaks down or at least becomes\nless clear when the order between G1 and G2 cannot be de\ufb01ned. From Proposition 1, it is known\nthat when G1 < G2, we have L2\n. It is not dif\ufb01cult to show that it is also true that when\nG1\n\n\u2282 L2\n\nG2\n\n4\n\n\fG1 < G2, then HK,G1 \u2282 HK,G2. This observation motivates us to de\ufb01ne the structural complexity\nmeasure of G through the capacity measure of its corresponding RKHS HG.\nThere exist a number of different capacity measures for RKHSs in the literature, including entropy\n[18], VC dimensions [17], Rademacher complexity [1], and covering numbers ([14], [18]). After\ninvestigating and comparing different measures, we use covering number to design a practically\nconvenient complexity measure for group additive structures.\nIt is known that an RKHS HK can be embedded in the continuous function space C(X ) (see [12],\n[18]), with the inclusion mapping denoted as IK : HK \u2192 C(X ). Let HK,r = {h : (cid:107)h(cid:107)Hk \u2264\nr, and h \u2208 HK} be an r-ball in HK and I (HK,r) be the closure of I (HK,r) \u2286 C(X ). One way\nto measure the capacity of HK is through the covering number of I (HK,r) in C(X ), denoted as\nN (\u0001, I (HK,r), d\u221e), which is the smallest cardinalty of a subset S of C(X ) such that I (HK,r) \u2282\n\u222as\u2208S{t \u2208 C(X ) : d\u221e(t, s) \u2264 \u0001}. Here \u0001 is any small positive value and d\u221e is the sup-norm.\nThe exact formula for N (\u0001, I (HK,r), d\u221e) is in general not available. Under certain conditions,\nvarious upper bounds have been obtained in the literature. One such upper bound is presented below.\nWhen K is a convolution kernel, i.e. K(x, t) = k(x \u2212 t), and the Fourier transform of k decays\nexponentially, then it is given in [18] that\n\nlnN(cid:16)\n\n\u0001, I(HK,r), d\u221e\n\n(cid:17) \u2264 Ck,p\n\n(cid:16)\n\nln\n\nr\n\u0001\n\n(cid:17)p+1\n\n,\n\n(4)\n\nwhere Ck,p is a constant depending on the kernel function k and input dimension p. In particular,\nwhen K is a Gaussian kernel, [18] has obtained more elaborate upper bounds.\nThe upper bound in (4) depends on r and \u0001 through ln(r/\u0001). When \u0001 \u2192 0 with r being \ufb01xed (e,g.\nr = 1 when a unit ball is considered), (ln(r/\u0001))p+1 dominates the upper bound. According to [8], the\ngrowth rate of N (\u0001, IK) or its logarithm can be viewed as a capacity measure of RKHS. So we use\n(ln(r/\u0001))p+1 as the capacity measure, which can be reparameterized as \u03b1p+1 with \u03b1 = ln(r/\u0001). Let\nC(Hk) denote the capacity measure of Hk, which is de\ufb01ned as C(Hk) = (ln(r/\u0001))p+1 = \u03b1(\u0001)p+1.\nWe know \u0001 is the radius of a covering ball, which is the unit of measurement we use to quantify the\ncapacity. The capacity of two RKHSs with different input dimensions are easier to be differentiated\nwhen \u0001 is small. This gives an interpretation of \u03b1.\nWe have de\ufb01ned a capacity measure for a general RKHS. In Problem (3), the model space HG is a\ndirect sum of a number of RKHSs. Let G = {u1, . . . , ud}; let HG,Hu1, . . . ,Hud be the RKHSs\ncorresponding to G, u1, . . . , ud, respectively; let IG, Iu1, . . . , Iud be the inclusion mappings of\nHG,Hu1, . . . ,Hud into C(X ). Then, we have the following proposition.\nProposition 3. Let G be a group additive structure and HG be the induced direct sum RKHS de\ufb01ned\nin (3). Then, we have the following inequality relating the covering number of HG and the covering\nnumbers of Huj\n\nlnN (\u0001, IG, d\u221e) \u2264 d(cid:88)\n\nlnN\n\n(cid:18) \u0001\n\nj=1\n\n(cid:19)\n\n|G| , Iuj , d\u221e\n\n,\n\n(5)\n\nwhere |G| denotes the number of groups in G.\nBy applying Proposition 3 and using the parameterized upper bound, we have lnN (\u0001, IG, d\u221e) =\n\nu\u2208G \u03b1(\u0001)|u|+1(cid:1) . The rate can be used as the explicit expression of the complexity measure\n\nO(cid:0)(cid:80)\nfor group additive structures, that is C(G) =(cid:80)d\n\nj=1 \u03b1(\u0001)|uj|+1. Recall that there is another tuning\nparameter \u00b5 which controls the effect of the complexity of group structure on the penalized risk.\nBy combining the common factor 1 in the exponent with \u00b5, we could further simplify the penalty\u2019s\nexpression. Thus, we have the following explicit formulation for GASI\n\n\u02c6f\u03bb,\u00b5, \u02c6G = arg min\nfG\u2208HG,G\n\n(yi \u2212 fG(xi))2 + \u03bb(cid:107)fG(cid:107)2HG\n\n+ \u00b5\n\n\u03b1|uj|\n\n(6)\n\n\uf8f1\uf8f2\uf8f3 n(cid:88)\n\ni=1\n\nd(cid:88)\n\n\uf8fc\uf8fd\uf8fe .\n\nj=1\n\n5\n\n\f2.4 Estimation\n\nWe assume that the value of \u03bb is given. In practice, \u03bb can be tuned separately. If the values of \u00b5 and\n\u03b1 are also given, Problem (6) can be solved by following a two-step procedure.\nFirst, when the group structure G is known, fu can be estimated by solving the following problem\n\n(cid:41)\n\n\u02c6R\u03bb\nG = min\nfG\u2208HG\n\n1\nn\n\n(yi \u2212 fG(xi))2 + \u03bb(cid:107)fG(cid:107)2HG\n\n.\n\n(7)\n\n(cid:40)\n\nn(cid:88)\n\ni=1\n\nSecond, the optimal group structure is chosen to achieve both small \ufb01tting error and complexity, i.e.\n\n\uf8f1\uf8f2\uf8f3 \u02c6R\u03bb\n\nd(cid:88)\n\nj=1\n\n\uf8fc\uf8fd\uf8fe .\n\n\u02c6G = arg min\n\nG\u2208G\n\nG + \u00b5\n\n\u03b1|uj|\n\n(8)\n\nThe two-step procedure above is expected to identify the intrinsic group structure, that is, \u02c6G = G\u2217.\nRecall a group structure belongs to one of the three categories, intrinsic, amiable, or non-amiable\nstructures. If G is non-amiable, then \u02c6R\u03bb\nG is expected to be large, because G is a wrong structure\nwhich will result in a biased estimate. If G is amiable, though \u02c6R\u03bb\nG is expected to be small, the\ncomplexity penalty of G is larger than that for G\u2217. As a consequence, only G\u2217 can simultaneously\nachieve a small \u02c6R\u03bb\nG\u2217 and a relatively small complexity. Therefore, when the sample size is large\nenough, we expect \u02c6G = G\u2217 with high probability. If the values of \u00b5 and \u03b1 are not given, a separate\nvalidation set can be used to select tuning parameters.\nThe two-step estimation is summarized in Algorithm 1. When a model contains a large number of\npredictor variables, such exhaustive search suffers high computational cost. In order to apply GASI\non a large model, we propose a backward stepwise algorithm which is illustrated in Algorithm 2.\n\nfor G \u2208 G do\n\n\u02c6RG, \u02c6fG \u2190 solve (7) using G;\nCalculate the sum in (8), denoted by \u02c6Rpen,\u00b5,\u03b1\n\nAlgorithm 1: Exhaustive Search w/ Validation\n1: Split data into training (T ) and validation (V) sets.\n2: for (\u00b5, \u03b1) in grid do\n3:\n4:\n5:\nend for\n6:\n\u02c6G\u00b5,\u03b1 \u2190 arg minG\u2208G \u02c6Rpen,\u00b5,\u03b1\n7:\n\u02c6yV \u2190 \u02c6f(cid:98)G\u00b5,\u03b1 (xV );\n8:\ne2(cid:98)G\u00b5,\u03b1 \u2190 (cid:107)yV \u2212 \u02c6yV(cid:107)2;\n9:\n10: end for\n11: \u00b5\u2217, \u03b1\u2217 \u2190 arg min\u00b5,\u03b1 e2(cid:98)G\u00b5,\u03b1;\n12: G\u2217 \u2190 \u02c6G\u00b5\u2217,\u03b1\u2217\n\nG\n\nG\n\n;\n\n;\n\n;\n\nAlgorithm 2: Basic Backward Stepwise\n1: Start with the group structure {(1, . . . , p)};\n2: Solve (6) and obtain its minimum value \u02c6Rpen\nG ;\n3: for each predictor variable j do\nG(cid:48) \u2190 either split j as a new group or add\n4:\n\nto an existing group;\nSolve (6) and obtain its minimum value\n\u02c6Rpen\nG(cid:48) ;\nif \u02c6Rpen\n\nG(cid:48) < \u02c6Rpen\nKeep G(cid:48) as the new group structure;\n\nG then\n\n5:\n\n6:\n7:\nend if\n8:\n9: end for\n10: return G(cid:48);\n\n3 Theory\n\n1\nn\n\nIn this section, we prove that the estimated group additive structure \u02c6G as a solution to (6) is consistent,\nthat is the probability P ( \u02c6G = G\u2217) goes to 1 as the sample size n goes to in\ufb01nity. The proof and\nsupporting lemmas are included in the supplementary material.\n(cid:80)n\nLet R(fG) = E[(Y \u2212 f (X))2] denote the population risk of a function f \u2208 HG, and \u02c6R(f ) =\ni=1(yi \u2212 f (xi))2 be the empirical risk. First, we show that for any amiable structure G \u2208 Ga,\nits minimized empirical risk \u02c6R( \u02c6fG) converges in probability to the optimal population risk R(f\u2217\nG\u2217 )\nachieved by the intrinsic group additive structure. Here \u02c6fG denotes the minimizer of Problem (7)\nwith the given G, and f\u2217\nG\u2217 denotes the minimizer of the population risk when the intrinsic group\nstructure is used. The result is given below as Proposition 4.\nProposition 4. Let G\u2217 be the intrinsic group additive structure, G \u2208 Ga a given amiable group\nstructure, and HG\u2217 and HG the respective direct sum RKHSs. If \u02c6f \u03bb\nG \u2208 HG is the optimal solution of\n\n6\n\n\fID\nM1\nM2\nM3\nM4\nM5\n\n2 + x3\n\ny = 1\n\ny = 2x1 + x2\n\n+ arcsin(cid:0) x2+x3\ny = arcsin(cid:0) x1+x3\n(cid:1) + 1\n(cid:110)(cid:112)x2\n\ny = exp\n\n1+x2\n1\n\n2\n\n1+x2\n2\n\nModel\n\n3 + sin(\u03c0x4) + log(x5 + 5) + |x6| + \u0001\n\n(cid:1) + arctan(cid:0)(x4 + x5 + x6)3(cid:1) + \u0001\n+ arctan(cid:0)(x4 + x5 + x6)3(cid:1) + \u0001\n\n2\n\ny = x1 \u00b7 x2 + sin((x3 + x4) \u00b7 \u03c0) + log(x5 \u00b7 x6 + 10) + \u0001\n\n1 + x2\n\n2 + x2\n\n3 + x2\n\n4 + x2\n\n5 + x2\n6\n\n+ \u0001\n\n(cid:111)\n\nIntrinsic Group Structure\n\n{(1) , (2) , (3) , (4) , (5) , (6)}\n{(1) , (2, 3) , (4, 5, 6)}\n{(1, 3) , (2) , (4, 5, 6)}\n{(1, 2) , (3, 4) , (5, 6)}\n{(1, 2, 3, 4, 5, 6)}\n\nTable 1: Selected models for the simulation study using the exhaustive search method and the corresponding\nadditive group structures.\n\nProblem (7), then for any \u0001 > 0, we have\n\n(cid:16)|(cid:98)R( \u02c6fG) \u2212 R(f\n\nP\n\nG\u2217 )| > \u0001\n\u2217\n\n(cid:40)(cid:88)\n(cid:17) \u2264 12n \u00b7 exp\n(cid:40)(cid:88)\n\n12n \u00b7 exp\n\nu\u2208G\nlnN\n\n(cid:19)\n\nlnN\n\n(cid:18) \u0001\n12|G| ,Hu, d\u221e\n(cid:18) \u0001\n(cid:19)\n12|G| ,Hu, d\u221e\n\n\u2212 n\n\n\u2212 \u00012n\n144\n\n(cid:18) \u0001\n\n24\n\n(cid:41)\n\n+\n\n\u2212 \u03bbn(cid:107)f\u2217\n\nG\u2217(cid:107)2\n12\n\n(cid:19)2(cid:41)\n\n.\n\n(9)\n\nu\u2208G\n\nNote that \u03bbn in (9) must be chosen such that \u0001/24 \u2212 \u03bbn(cid:107)f\u2217\nG\u2217(cid:107)2/12 is positive. For a \ufb01xed p, the\nnumber of amiable group additive structures is \ufb01nite. Using a Bonferroni type of technique, we can\nin fact obtain a uniform upper bound for all of the amiable group additive structures in Ga.\n(cid:27)\nTheorem 2. Let Ga be the set of all amiable group structures. For any \u0001 > 0 and n > 2/\u00012, we have\n(cid:19)2(cid:41)(cid:35)\n\n|(cid:98)Rg( \u02c6f \u03bb\n\n\u2264 12n|Ga| \u00b7\n\nG) \u2212 Rg(f\n\n,HG, d\u221e\n\nG\u2217 )| > \u0001\n\u2217\n\nsup\nG\u2208Ga\n\nmax\nG\u2208Ga\n\n(cid:40)\n\n(cid:26)\n\n(cid:18)\n\n(cid:19)\n\nexp\n\nP\n\nlnN(cid:16) \u0001\n(cid:18) \u0001\n(cid:17) \u2212 n\n\n12\n\n24\n\n(cid:17) \u2212 \u00012n\n\u2212 \u03bbn(cid:107)f\u2217\n\n144\nG\u2217(cid:107)2\n12\n\n(cid:34)\nlnN(cid:16) \u0001\n\n,HG, d\u221e\n\n12\n\n+ exp\n\nmax\nG\u2208Ga\n\n(10)\n\nG\u2217 ), and | \u02c6R( \u02c6fG) \u2212 R(f\u2217\n\nNext we consider a non-amiable group additive structure G(cid:48) \u2208 G \\Ga. It turns out that \u02c6R( \u02c6fG) fails to\nG\u2217 )| converges to a positive constant. Furthermore, because\nconverge to R(f\u2217\nthe number of non-amiable group additive structures is \ufb01nite, we can show that | \u02c6R( \u02c6fG) \u2212 R(f\u2217\nG\u2217 )|\nis uniformly bounded below from zero with probability going to 1. We state the results below.\nTheorem 3. (i) For a non-amiable group structure G \u2208 G \\ Ga, there exists a constant C > 0 such\nG\u2217 )| converges to C in probability. (ii) There exits a constant \u02dcC such that\nthat | \u02c6Rg( \u02c6f \u03bb\nP (| \u02c6Rg( \u02c6f \u03bb\nG\u2217 )| > \u02dcC for all G \u2208 G \\ Ga) goes to 1 as n goes to in\ufb01nity.\nBy combining Theorem 2 and Theorem 3, we can prove consistency for our GASI method.\nTheorem 4. Let \u03bbn \u2217 n \u2192 0. By choosing a proper tuning parameter \u00b5 > 0 for the structural\npenalty , the estimated group structure \u02c6G is consistent for the intrinsic group additive structure G\u2217,\nthat is, P ( \u02c6G = G\u2217) goes to one as the sample size n goes to in\ufb01nity.\n\nG) \u2212 Rg(f\u2217\nG) \u2212 Rg(f\u2217\n\n4 Simulation\n\nIn this section, we evaluate the performance of GASI using synthetic data. Table 1 lists the \ufb01ve models\nwe are using. Observations of X are simulated independently from N (0, 1) in M1, Unif(\u22121, 1) in\nM2 and M3, and Unif(0, 2) in M4 and M5. The noise \u0001 is i.i.d. N (0, 0.012). The grid values of \u00b5\nare equally spaced in [1e\u221210, 1/64] on a log-scale and each \u03b1 is an integer in [1, 10].\nWe \ufb01rst show that GASI has the ability to identify the intrinsic group additive structure. The two-step\nprocedure is carried out for each (\u00b5, \u03b1) pair multiple times. If there are (\u00b5, \u03b1) pairs for each model\nthat the true group structure can be often identi\ufb01ed, then GASI has the power to identify true group\nstructures. We also apply Algorithm 1 which uses an additional validation set to select the parameters.\nWe simulate 100 different samples for each model. The frequency of the true group structure being\nidenti\ufb01ed is calculated for each (\u00b5, \u03b1).\n\n7\n\n\fModel Max freq.\nM1\nM2\nM3\nM4\nM5\n\n100\n97\n97\n100\n100\n\n\u00b5\n\n1.2500e-06\n1.2500e-06\n1.2500e-06\n1.2500e-06\n1.2500e-06\n\n\u03b1 Max freq.\n10\n8\n9\n7\n1\n\n59\n89\n89\n99\n100\n\n\u00b5\n\n1.2500e-06\n1.2500e-06\n1.2500e-06\n1.2500e-06\n1.2500e-06\n\n\u03b1 Max freq.\n4\n7\n7\n4\n1\n\n99\n70\n65\n1\n100\n\n\u00b5\n\n1.5625e-02\n1.3975e-04\n1.3975e-04\n1.3975e-04\n1.2500e-06\n\n\u03b1\n10\n9\n8\n8\n1\n\nTable 2: Maximum frequencies that the intrinsic group additive structures are identi\ufb01ed for the \ufb01ve models\nusing exhaustive search algorithm without parameter tuning (left panel), with parameter tuning (middle panel)\nand stepwise algorithm (right panel). If different pairs share the same max frequency, a pair is randomly chosen.\n\nFigure 1: Estimated transformation functions for selected groups. Top-left: group (1, 6), top-right:\ngroup (3), bottom-left: group (5, 8), bottom-right: group (10, 12).\n\nIn Table 2, we report the maximum frequency and the corresponding (\u00b5, \u03b1) for each model. The\ncomplete results are included in the supplementary material. It can be seen from the left panel that\nthe intrinsic group additive structures can be successfully identi\ufb01ed. When the parameters are tuned,\nthe middle panel shows that the performance of Model 1 deteriorated. This might be caused by the\nestimation method (KRR to solve Problem (7)) used in the algorithm. It could also be affected by \u03bb.\nWhen the number of predictor variables increases, we use a backward stepwise algorithm. We apply\nAlgorithm 2 on the same models. The results are reported in the right panel in Figure 2. The true\ngroup structures could be identi\ufb01ed most of time for Model 1, 2, 3, 5. The result of Model 4 is not\nsatisfying. Since stepwise algorithm is greedy, it is possible that the true group structures were never\nvisited. Further research is needed to develop a better algorithms.\n\n5 Real Data\n\nIn this section, we report the results of applying GASI on the Boston Housing data (another real\ndata application is reported in the supplementary material). The data includes 13 predictor variables\nused to predict the house median value. The sample size is 506. Our goal is to identify a probable\ngroup additive structure for the predictor variables. The backward algorithm is used and the tuning\nparameters \u00b5 and \u03b1 are selected via 10-fold CV. The group structure that achieves the lowest average\nvalidation error is {(1, 6) , (2, 11) , (3) , (4, 9) , (5, 8) , (7, 13) , (10, 12)}, which is used for further\ninvestigation. Then the nonparametric functions for each group were estimated using the whole\ndata set. Because each group contains no more than two variables, the estimated functions can be\nvisualized. Figure 1 shows the selected results.\nIt is interesting to see some patterns emerging in the plots. For example, the top-left plot shows the\nfunction of the average number of rooms per dwelling and per capita crime rate by town. We can see\nthe house value increases with more rooms and decreases as the crime rate increases. However, when\nthe crime rate is low, smaller sized houses (4 or 5 rooms) seem to be preferred. The top-right plot\n\n8\n\n\fshows that there is a changing point in terms of how house value is related to the size of non-retail\nbusiness in the area. The value initially drops when the percentage of non-retail business is small,\nthen increases at around 8%. The increase in the value might be due to the high demand of housing\nfrom the employees of those business.\n\n6 Discussion\n\nWe use group additive model for nonparametric regression and propose a RKHS complexity penalty\nbased approach for identifying the intrinsic group additive structure. There are two main directions\nfor future research. First, our penalty function is based on the covering number of RKHSs. It is of\ninterest to know if there exists other more effective penalty functions. Second, it is of great interest\nto further improve the proposed method and apply it in general high dimensional nonparametric\nregression.\n\n9\n\n\fReferences\n[1] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.\n\nThe Journal of Machine Learning Research, 3:463\u2013482, 2003.\n\n[2] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-\n\nVerlag New York, Inc., Secaucus, NJ, USA, 2006.\n\n[3] C. Carmeli, E. De Vito, A. Toigo, and V. Umanit\u00e1. Vector valued reproducing kernel hilbert spaces and\n\nuniversality. Analysis and Applications, 8(01):19\u201361, 2010.\n\n[4] C. Gu. Smoothing spline ANOVA models, volume 297. Springer Science & Business Media, 2013.\n\n[5] T. Hastie and R. Tibshirani. Generalized additive models. Statistical science, pages 297\u2013310, 1986.\n\n[6] K. Kandasamy and Y. Yu. Additive approximations in high dimensional nonparametric regression via\nthe salsa. In Proceedings of the 33rd International Conference on International Conference on Machine\nLearning - Volume 48, ICML\u201916, pages 69\u201378. JMLR.org, 2016.\n\n[7] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive\n\nComputation and Machine Learning. The MIT Press, 2009.\n\n[8] T. K\u00fchn. Covering numbers of Gaussian reproducing kernel Hilbert spaces. Journal of Complexity,\n\n27(5):489\u2013499, 2011.\n\n[9] B. M. Marlin and K. P. Murphy. Sparse gaussian graphical models with unknown block structure. In\nProceedings of the 26th Annual International Conference on Machine Learning, ICML \u201909, pages 705\u2013712,\nNew York, NY, USA, 2009. ACM.\n\n[10] K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.\n\n[11] C. Pan, Q. Huang, and M. Zhu. Optimal kernel group transformation for exploratory regression analysis\nand graphics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, pages 905\u2013914. ACM, 2015.\n\n[12] T. Poggio and C. Shelton. On the mathematical foundations of learning. American Mathematical Society,\n\n39(1):1\u201349, 2002.\n\n[13] J. O. Ramsay and B. W. Silverman. Applied functional data analysis: methods and case studies, volume 77.\n\nCiteseer, 2002.\n\n[14] A. J. Smola and B. Sch\u00f6lkopf. Learning with kernels. Citeseer, 1998.\n\n[15] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.\n\n[16] C. J. Stone. Optimal global rates of convergence for nonparametric regression. The annals of statistics,\n\npages 1040\u20131053, 1982.\n\n[17] V. Vapnik. The nature of statistical learning theory. Springer Science & Business Media, 2013.\n\n[18] D.-X. Zhou. The covering number in learning theory. Journal of Complexity, 18(3):739 \u2013 767, 2002.\n\n10\n\n\f", "award": [], "sourceid": 2539, "authors": [{"given_name": "Chao", "family_name": "Pan", "institution": "Purdue University"}, {"given_name": "Michael", "family_name": "Zhu", "institution": "Purdue University"}]}