{"title": "Directed Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 889, "page_last": 897, "abstract": "When used to guide decisions, linear regression analysis typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. When there are multiple response variables and features do not perfectly capture their relationships, it is beneficial to account for the decision objective when computing regression coefficients. Empirical optimization does so but sacrifices performance when features are well-chosen or training data are insufficient. We propose directed regression, an efficient algorithm that combines merits of ordinary least squares and empirical optimization. We demonstrate through a computational study that directed regression can generate significant performance gains over either alternative. We also develop a theory that motivates the algorithm.", "full_text": "Directed Regression\n\nYi-hao Kao\n\nStanford University\nStanford, CA 94305\n\nBenjamin Van Roy\nStanford University\nStanford, CA 94305\n\nXiang Yan\n\nStanford University\nStanford, CA 94305\n\nyihaokao@stanford.edu\n\nbvr@stanford.edu\n\nxyan@stanford.edu\n\nAbstract\n\nWhen used to guide decisions, linear regression analysis typically involves esti-\nmation of regression coef\ufb01cients via ordinary least squares and their subsequent\nuse to make decisions. When there are multiple response variables and features\ndo not perfectly capture their relationships, it is bene\ufb01cial to account for the de-\ncision objective when computing regression coef\ufb01cients. Empirical optimization\ndoes so but sacri\ufb01ces performance when features are well-chosen or training data\nare insuf\ufb01cient. We propose directed regression, an ef\ufb01cient algorithm that com-\nbines merits of ordinary least squares and empirical optimization. We demonstrate\nthrough a computational study that directed regression can generate signi\ufb01cant\nperformance gains over either alternative. We also develop a theory that motivates\nthe algorithm.\n\n1 Introduction\n\nWhen used to guide decision-making, linear regression analysis typically treats estimation of re-\ngression coef\ufb01cients separately from their use to make decisions. In particular, estimation is carried\nout via ordinary least squares (OLS) without consideration of the decision objective. The regression\ncoef\ufb01cients are then used to optimize decisions.\nWhen there are multiple response variables and features do not perfectly capture their relationships,\nit is bene\ufb01cial to account for the decision objective when computing regression coef\ufb01cients. Im-\nperfections in feature selection are common since it is dif\ufb01cult to identify the right features and the\nnumber of features is typically restricted in order to avoid over-\ufb01tting.\nEmpirical optimization (EO) is an alternative to OLS which selects coef\ufb01cients that minimize em-\npirical loss in the training data. Though it accounts for the decision objective when computing\nregression coef\ufb01cients, EO sacri\ufb01ces performance when features are well-chosen or training data is\ninsuf\ufb01cient.\nIn this paper, we propose a new algorithm \u2013 directed regression (DR) \u2013 which is a hybrid between\nOLS and EO. DR selects coef\ufb01cients that are a convex combination of those that would be se-\nlected by OLS and those by EO. The weights of OLS and EO coef\ufb01cients are optimized via cross-\nvalidation.\nWe study DR for the case of decision problems with quadratic objective functions. The algorithm\ntakes as input a training set of data pairs, each consisting of feature vectors and response variables,\ntogether with a quadratic loss function that depends on decision variables and response variables.\nRegression coef\ufb01cients are computed for subsequent use in decision-making. Each future decision\ndepends on newly sampled feature vectors and is made prior to observing response variables with\nthe goal of minimizing expected loss.\nWe present computational results demonstrating that DR can substantially outperform both OLS and\nEO. These results are for synthetic problems with regression models that include subsets of relevant\n\n1\n\n\ffeatures. In some cases, OLS and EO deliver comparable performance while DR reduces expected\nloss by about 20%. In none of the cases considered does either OLS or EO outperform DR.\nWe also develop a theory that motivates DR. This theory is based on a model in which selected\nfeatures do not perfectly capture relationships among response variables. We prove that, for this\nmodel, the optimal vector of coef\ufb01cients is a convex combination of those that would be generated\nby OLS and EO.\n\n2 Linear Regression for Decision-Making\n\n(cid:80)\n\n1 , . . . , x(n)\n\nSuppose we are given a set of training data pairs O = {(x(1), y(1)),\u00b7\u00b7\u00b7 , (x(N ), y(N ))}. Each nth\nK \u2208 (cid:60)M and a vector y(n) \u2208 (cid:60)M of response\ndata pair is comprised of feature vectors x(n)\nvariables. We would like to compute regression coef\ufb01cients r \u2208 (cid:60)K so that given a data pair (x, y),\nk rkxk of feature vectors estimates the expectation of y conditioned on x.\nthe linear combination\nWe restrict attention to cases where M > 1, with special interest in problems where M is large,\nbecause it is in such situations that DR offers the largest performance gains.\nWe consider a setting where the regression model is used to guide future decisions. In particular,\nafter computing regression coef\ufb01cients, each time we observe feature vectors x1, . . . , xK we will\nhave to select a decision u \u2208 (cid:60)L before observing the response vector y. The choice incurs a loss\n\n(cid:96)(u, y) = u(cid:62)G1u + u(cid:62)G2y,\n\n(cid:80)K\n\nwhere the matrices G1 \u2208 (cid:60)L\u00d7L and G2 \u2208 (cid:60)L\u00d7M are known, and the former is positive de\ufb01nite and\nsymmetric. We aim to minimize expected loss, assuming that the conditional expectation of y given\nx is\n\nk=1 rkxk. As such, given x and r, we select a decision\n\n(cid:33)\n\n(cid:195)\n\nK(cid:88)\n\nk=1\n\nK(cid:88)\n\nk=1\n\nur(x) = argmin\n\nu\n\n(cid:96)\n\nu,\n\nrkxk\n\n= \u22121\n\n2 G\u22121\n\n1 G2\n\nrkxk.\n\nThe question is how best to compute the regression coef\ufb01cients r for this purpose.\nTo motivate the setting we have described, we offer a hypothetical application.\nExample 1. Consider an Internet banner ad campaign that targets M classes of customers. An\naverage revenue of ym is received per customer of class m that the campaign reaches. This quantity\nis random and in\ufb02uenced by K observable factors x1m, . . . , xKm. These factors may be correlated\nacross customers classes; for example, they could capture customer preferences as they relate to\nad content or how current economic conditions affect customers. For each mth class, the cost of\nreaching the umth customer increases with um because ads are \ufb01rst targeted at customers that can\nbe reached at lower cost. This cost is quadratic, so that we pay \u03b3mu2\nm to reach um customers, where\n\u03b3m is a known constant.\nThe application we have described \ufb01ts our general problem context. It is natural to predict the\nresponse vector y using a linear combination\nk rkxk of factors with the regression coef\ufb01cients\nrk computed based on past observations O = {(x(1), y(1)),\u00b7\u00b7\u00b7 , (x(N ), y(N ))}. The goal is to\nmaximize expected revenue less advertising costs. This gives rise to a loss function that is quadratic\nin u and y:\n\n(cid:80)\n\nM(cid:88)\n\n(cid:96)(u, y) =\n\n(\u03b3mu2\n\nm \u2212 umym).\n\nm=1\n\nOne might ask why not construct M separate linear regression models, one for each response vari-\nable, each with a separate set of K coef\ufb01cients. The reason is that this gives rise to M K coef\ufb01cients;\nwhen M is large and data is limited, this could lead to over-\ufb01tting. Models of the sort we consider,\nwhere regression coef\ufb01cients are shared across multiple response variables, are sometimes referred\nto as general linear models and have seen a wide range of applications [7, 8]. It is well-known\nthat the quality of results is highly sensitive to the choice of features, even more so than for models\ninvolving a single response variable [7].\n\n2\n\n\f3 Algorithms\n\nOrdinary least squares (OLS) is a conventional approach to computing regression coef\ufb01cients. This\nwould produce a coef\ufb01cient vector\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)2\n\nN(cid:88)\n\nn=1\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)y(n) \u2212 K(cid:88)\nN(cid:88)\n\nk=1\n\nrOLS = argmin\nr\u2208(cid:60)K\n\nrkx(n)\n\nk\n\n.\n\n(1)\n\nNote that OLS does not take the decision objective into account when computing regression coef\ufb01-\ncients. Empirical optimization (EO), as studied for example in [2, 6], offers an alternative that does\nso. This approach minimizes empirical loss on the training data:\n\nrEO = argmin\nr\u2208(cid:60)K\n\nn=1\n\n(cid:96)(ur(x(n)), y(n)).\n\n(2)\n\nNote that EO does not explicitly aim to estimate the conditional expectation of the response vector.\nInstead it focusses on decision loss that would be incurred with the training data. Both rOLS and\nrEO can be computed ef\ufb01ciently by minimizing convex quadratic functions.\nAs we will see in our computational and theoretical analyses, OLS and EO can be viewed as two\nextremes, each offering room for improvement. In this paper, we propose an alternative algorithm\n\u2013 directed regression (DR) \u2013 which produces a convex combination rDR = (1 \u2212 \u03bb)rOLS + \u03bbrEO\nof coef\ufb01cients computed by OLS and EO. The term directed is chosen to indicate that DR is in\ufb02u-\nenced by the decision objective though, unlike EO, it does not simply minimize empirical loss. The\nparameter \u03bb \u2208 [0, 1] is computed via cross-validation, with an objective of minimizing average loss\non validation data. Average loss is a convex quadratic function of \u03bb, and therefore can be easily\nminimized over \u03bb \u2208 [0, 1].\nDR is designed to generate decisions that are more robust to imperfections in feature selection than\nOLS. As such, DR addresses issues similar to those that have motivated work in data-driven robust\noptimization, as surveyed in [3]. Our focus on making good decisions despite modeling inaccuracies\nalso complements recent work that studies how models deployed in practice can generate effective\ndecisions despite their failure to pass basic statistical tests [4].\n\n4 Computational Results\n\nIn this section, we present results from applying OLS, EO, and DR to synthetic data. To generate a\ndata set, we \ufb01rst sample parameters of a generative model as follows:\n\npendently from N (0, 1).\n\n1. Sample P matrices C1, . . . , CP \u2208 (cid:60)M\u00d7Q, with each entry from each matrix drawn inde-\n2. Sample a vector \u02dcr \u2208 (cid:60)P from N (0, I).\n3. Sample Ga \u2208 (cid:60)L\u00d7L and Gb \u2208 (cid:60)L\u00d7M , with each entry of each matrix drawn from N (0, 1).\n\nLet G1 = G(cid:62)\n\na Ga and G2 = G(cid:62)\n\na Gb.\n\nGiven generative model parameters C1, . . . , CP and \u02dcr, we sample each training data pair (x(n), y(n))\nas follows:\n\n(cid:80)P\n\n1. Sample a vector \u03c6(n) \u2208 (cid:60)Q from N (0, I) and a vector w(n) \u2208 (cid:60)M from N (0, \u03c32\n2. Let y(n) =\ni=1 \u02dcriCi\u03c6(n) + w(n).\n3. For each k = 1, 2,\u00b7\u00b7\u00b7 , K, let x(n)\n\nk = Ck\u03c6(n).\n\nwI).\n\nThe vector \u03c6(n) can be viewed as a sample from an underlying information space. The matrices\nC1, . . . , CP extract feature vectors from \u03c6(n). Note that, though response variables depend on P\nfeature vectors, only K \u2264 P are used in the regression model.\nGiven generative model parameters and a coef\ufb01cient vector r \u2208 (cid:60)K, it is easy to evaluate the\nexpected loss (cid:96)(r) = Ex,y[(cid:96)(ur(x), y)]. It is also easy to evaluate the minimal expected loss (cid:96)\u2217 =\n\n3\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Excess losses delivered by OLS, EO, and DR, for different numbers N of training\nsamples. (b) Excess losses delivered by OLS, EO, and DR, using different numbers K of the 60\nfeatures.\n\nminr Ex,y[(cid:96)(ur(x), y)]. We will assess each algorithm in terms of the excess loss (cid:96)(r)\u2212 (cid:96)\u2217 delivered\nby the coef\ufb01cient vector r that the algorithm computes. Excess loss is nonnegative, and this allows\nus to make comparisons in percentage terms.\nWe carried out two sets of experiments to compare the performance of OLS, EO, and DR. In the\n\ufb01rst set, we let M = 15, L = 15, P = 60, Q = 20, \u03c3w = 5, and K = 50. For each N \u2208\n{10, 15, 20, 30, 50}, we ran 100 trials, each with an independently sampled generative model and\ntraining data set. In each trial, each algorithm computes a coef\ufb01cient vector given the training data\nand loss function. With DR, \u03bb is selected via leave-one-out cross-validation when N \u2264 20, and via\n5-fold cross-validation when N > 20. Figure 1(a) plots excess losses averaged over trials. Note that\nthe excess loss incurred by DR is never larger than that of OLS or EO. Further, when N = 20, the\nexcess loss of OLS and EO are both around 20% larger than that of DR. For small N, OLS is as\neffective as DR, while, EO becomes as effective as DR as N grows large.\nIn the second set of experiments, we use the same parameter values as in the \ufb01rst set, except we \ufb01x\nN = 20 and consider use of K \u2208 {45, 50, 55, 58, 60} feature vectors. Again, we ran 100 trials for\neach K, applying the three algorithms as in the \ufb01rst set of experiments. Figure 1(b) plots excess\nlosses averaged over trials. Note that when K = 55, DR delivers excess loss around 20% less than\nEO and OLS. When K = P = 60, there are no missing features and OLS matches the performance\nof DR.\nFigure 2 plots the values of \u03bb selected by cross-validation, each averaged over the 100 trials, as\na function of N and K. As the number of training samples N grows, so does \u03bb, indicating that\nDR is weighted more heavily toward EO. As the number of feature vectors K grows, \u03bb diminishes,\nindicating that DR is weighted more heavily toward OLS.\n\n5 Theoretical Analysis\n\nIn this section, we formulate a generative model for the training data and future observations. For\nthis model, optimal coef\ufb01cients are convex combinations of rOLS and rEO. As such, our model and\nanalysis motivate the use of DR.\n\n5.1 Model\n\nIn this section, we describe a generative model that samples the training data set, as well as \u201cmissing\nfeatures,\u201d and a representative future observation. We then formulate an optimization problem where\nthe objective is to minimize expected loss on the future observation conditioned on the training data\nand missing features. It may seem strange to condition on missing features since in practice they are\nunavailable when computing regression coef\ufb01cients. However, we will later establish that optimal\n\n4\n\n10203040500200040006000800010000NExcess Loss OLSEODR455055600100020003000400050006000KExcess Loss OLSEODR\f(a)\n\n(b)\n\nFigure 2: (a) The average values of selected \u03bb, for different numbers N of training samples. (b) The\naverage values of selected \u03bb, using different numbers K of the 60 features.\n\ncoef\ufb01cients are convex combinations of rOLS and rEO, each of which can be computed without\nobserving missing features. Since directed regression searches over these convex combinations, it\nshould approximate what would be generated by a hypothetical algorithm that observes missing\nfeatures.\nWe will assume that each feature, whether observed or missing, is a linear function of an \u201cinfor-\nmation vector\u201d drawn from (cid:60)Q. Speci\ufb01cally, the N training data samples depend on information\nvectors \u03c6(1), . . . , \u03c6(N ) \u2208 (cid:60)Q. A linear function mapping an information vector to a feature vector\ncan be represented by a matrix in (cid:60)M\u00d7Q, and to describe our generative model, it is useful to de\ufb01ne\nan inner product for such matrices. In particular, we de\ufb01ne the inner product between matrices A\nand B by\n\nN(cid:88)\n\nn=1\n\n(cid:104)A, B(cid:105) =\n\n1\nN\n\n(A\u03c6(n))(cid:62)(B\u03c6(n)).\n\nOur generative model takes several parameters as input. First, there are the number of samples\nN, the number of response variables M, and the number of feature vectors K. Second, a parameter\n\u00b5Q speci\ufb01es the expected dimension of the information vector. Finally, there are standard deviations\n\u03c3r, \u03c3\u0001, and \u03c3w, of observed feature coef\ufb01cients, missing feature coef\ufb01cients, and noise, respectively.\nGiven parameters N, M, K, \u00b5Q, \u03c3r, \u03c3\u0001, and \u03c3w, the generative model produces data as follows:\n\n1. Sample Q from the geometric distribution with mean \u00b5Q.\n2. Sample \u03c6(1), . . . , \u03c6(N ) \u2208 (cid:60)Q from N (0, IQ).\n3. Sample C1, . . . , CK and D1,\u00b7\u00b7\u00b7 , DJ \u2208 (cid:60)M\u00d7Q with each entry i.i.d. from N (0, 1), where\n\nK + J = M Q.\n\n4. Apply the Gram-Schmidt algorithm with respect\n\nto the inner product de\ufb01ned\nabove to generate an orthonormal basis \u02dcC1, . . . , \u02dcCK, \u02dcD1, . . . , \u02dcDJ from the sequence\nC1, . . . , CK, D1, . . . , DJ.\n\n5. Sample r\u2217 \u2208 (cid:60)K from N (0, \u03c32\n6. For n = 1,\u00b7\u00b7\u00b7 , N, sample w(n) \u2208 (cid:60)M from N (0, \u03c32\n\n\u0001 IJ).\n\nr IK) and r\u22a5\u2217 \u2208 (cid:60)J from N (0, \u03c32\n(cid:105)\n(cid:104)\nwIM ), and let\n(cid:104)\n(cid:105)\nK(cid:88)\n\n\u00b7\u00b7\u00b7 CK\u03c6(n)\n\u00b7\u00b7\u00b7\n\u02dcDJ \u03c6(n)\n\nC1\u03c6(n)\n\u02dcD1\u03c6(n)\n\nJ(cid:88)\n\n,\n\n,\n\nr\u2217\nkx(n)\n\nk +\n\nr\u22a5\u2217\nj z(n)\n\nj + w(n).\n\nx(n) =\n\nz(n) =\n\ny(n) =\n\n(3)\n\n(4)\n\n(5)\n\nk=1\n\nj=1\n\n5\n\n102030405000.20.40.60.8Nl4550556000.20.40.60.8Kl\f7. Sample \u02dc\u03c6 uniformly from {\u03c6(1),\u00b7\u00b7\u00b7 , \u03c6(N )} and \u02dcw \u2208 (cid:60)M from N (0, \u03c32\n\nwIM ). Generate \u02dcx,\n\n\u02dcz, and \u02dcy by the same functions in (3), (4), and (5).\n\nThe samples z(1), . . . , z(N ), \u02dcz represent missing features. The Gram-Schmidt procedure ensures two\nproperties. First, since (cid:104)Ck, \u02dcDj(cid:105) = 0, missing features are uncorrelated with observed features. If\nthis were not the case, observed features would provide information about missing features. Second,\nsince \u02dcD1, . . . , \u02dcDJ are orthonormal, the distribution of missing features is invariant to rotations in the\nJ-dimensional subspace from which they are drawn. In other words, all directions in that space are\nequally likely.\nWe de\ufb01ne an augmented training set O = {(x(1), z(1), y(1)),\u00b7\u00b7\u00b7 , (x(N ), z(N ), y(N ))} and consider\nselecting regression coef\ufb01cients \u02c6r \u2208 (cid:60)K that solve\n\nE[(cid:96)(ur(\u02dcx), \u02dcy)|O].\n\nmin\nr\u2208(cid:60)K\n\nNote that the probability distribution here is implicitly de\ufb01ned by our generative model, and as such,\n\u02c6r may depend on N, M, K , \u00b5Q, \u03c3r, \u03c3\u0001, \u03c3w, and O.\n\n5.2 Optimal Solutions\nOur primary interest is in cases where prior knowledge about the coef\ufb01cients r\u2217 is weak and does\nnot signi\ufb01cantly in\ufb02uence \u02c6r. As such, we will from here on restrict attention to the case where \u03c3r is\nasymptotically large. Hence, \u02c6r will no longer depend on \u03c3r.\nIt is helpful to consider two special cases. One is where \u03c3\u0001 = 0 and the other is where \u03c3\u0001 is\nasymptotically large. We will refer to \u02c6r in these extreme cases as \u02c6r0 and \u02c6r\u221e. The following theorem\nestablishes that these extremes are delivered by OLS and EO.\nTheorem 1. For all N, M, K, \u00b5Q, \u03c3w, and O,\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)2\n\nrkx(n)\n\nk\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)y(n) \u2212 K(cid:88)\nN(cid:88)\nN(cid:88)\n\nk=1\n\nn=1\n\n\u02c6r0 = argmin\nr\u2208(cid:60)K\n\nand\n\n\u02c6r\u221e = argmin\nr\u2208(cid:60)K\n\nn=1\n\n(cid:96)(ur(x(n)), y(n)).\n\nNote that \u03c3\u0001 represents the degree of bias in a regression model that assumes there are no missing\nfeatures. Hence, the above theorem indicates that OLS is optimal when there is no bias while EO\nis optimal as the bias becomes asymptotically large.\nIt is also worth noting that the coef\ufb01cient\nvectors \u02c6r0 and \u02c6r\u221e can be computed without observing the missing features, though \u02c6r is de\ufb01ned by\nan expectation that is conditioned on their realizations. Further, computation of \u02c6r0 and \u02c6r\u221e does not\nrequire knowledge of Q or \u03c3w.\nOur next theorem establishes that the coef\ufb01cient vector \u02c6r is always a convex combination of \u02c6r0 and\n\u02c6r\u221e.\nTheorem 2. For all N, M, K, \u00b5Q, \u03c3w, \u03c3\u0001, and O,\n\n\u02c6r = (1 \u2212 \u03bb)\u02c6r0 + \u03bb\u02c6r\u221e,\n\nwhere \u03bb = 1\n\u03c32\nw\n1+\nN \u03c32\n\u0001\n\n.\n\nOur two theorems together imply that, with an appropriately selected \u03bb \u2208 [0, 1], (1 \u2212 \u03bb)rOLS +\n\u03bbrEO = \u02c6r. This suggests that directed regression, which optimizes \u03bb via cross-validation to generate\na coef\ufb01cient vector rDR = (1 \u2212 \u03bb)rOLS + \u03bbrEO, should approximate \u02c6r well without observing the\nmissing features or requiring knowledge of Q, \u03c3\u0001, or \u03c3w.\n\n6\n\n\fInterpretation\n\n5.3\nTo develop intuition for our results, we consider an idealized situation where the coef\ufb01cients r\u2217 and\nr\u22a5\u2217 are provided to us by an oracle. Then the optimal coef\ufb01cient vector would be\n\nrO = argmin\nr\u2208(cid:60)K\n\nE[(cid:96)(ur(\u02dcx), \u02dcy)|O, r\u2217, r\u22a5\u2217].\n\nIt can be shown that rOLS is a biased estimator of rO, while rEO is an unbiased one. However,\nthe variance of rOLS is smaller than that of rEO. The optimal tradeoff is indeed captured by the\nvalue of \u03bb provided in Theorem 2. In particular, as the number of training samples N increases,\nvariance diminishes and \u03bb approaches 1, placing increasing weight on EO. On the other hand, as\nthe number of observed features K increases, model bias decreases and \u03bb approaches 0, placing\nincreasing weight on OLS. Our experimental results demonstrate that the value of \u03bb selected by\ncross-validation exhibits the same behavior.\n\n6 Extensions\n\nThough we only treated linear models and quadratic objective functions, our work suggests that\nthere can be signi\ufb01cant gains in broader problem settings from a tighter coupling between machine\nlearning and decision-making. In particular, machine learning algorithms should factor decision\nobjectives into the learning process.\nIt will be interesting to explore how to do this with other\nclasses of models and objectives.\nOne might argue that feature mis-speci\ufb01cation is not a critical issue in light of effective methods\nfor subset selection. In particular, rather than selecting a few features and facing the consequences\nof model bias, one might select an enormous set of features and apply a method like the lasso [10]\nto identify a small subset. Our view is that even this enormous set will result in model biases that\nmight be ameliorated by generalizations of DR. There is also the concern that data requirements\ngrow with the size of the large feature set, albeit slowly. Understanding how to synthesize DR with\nsubset selection methods is an interesting direction for future research.\nAnother issue that should be explored is the effectiveness of cross-validation in optimizing \u03bb. In\nparticular, it would be helpful to understand how the estimate relates to the ideal value of \u03bb identi\ufb01ed\nby Theorem 2. More general work on the selection of convex combinations of models (e.g., [1, 5])\nmay lend insights to our setting.\nLet us close by mentioning that the ideas behind DR ought to play a role in reinforcement learning\n(RL) as presented in [9]. RL algorithms learn from experience to predict a sum of future rewards\nas a function of a state, typically by \ufb01tting a linear combination of features of the state. This so-\ncalled approximate value function is then used to guide sequential decision-making. The problem\nwe addressed in this paper can be viewed as a single-period version of RL, in the sense that each\ndecision incurs an immediate cost but bears no further consequences. It would be interesting to\nextend our idea to the multi-period case.\n\nAcknowledgments\n\nWe thank James Robins for helpful comments and suggestions. The \ufb01rst author is supported by a\nStanford Graduate Fellowship. This research was supported in part by the National Science Foun-\ndation through grant CMMI-0653876.\n\nx(N )(cid:62) (cid:164)(cid:62)\n\nx(n)\n1\n, Z\n\n\u00b7\u00b7\u00b7\n=\n\n(cid:163)\n\nx(n)\nK\n\n, z(n) =\nz(1)(cid:62) \u00b7\u00b7\u00b7\n\n(cid:105)\n\n(cid:104)\nz(N )(cid:62) (cid:164)(cid:62)\n\nz(n)\n1\n\n\u00b7\u00b7\u00b7\n,\n\n(cid:105)\n\n.\n\nz(n)\nJ\n\nLet X\n\n=\ny(1)(cid:62) \u00b7\u00b7\u00b7\n\n=\n, \u00afr = E[r\u2217|O], \u00afr\u22a5 = E[r\u22a5\u2217|O]. For any matrix V , let V \u2020 denote\n(V (cid:62)V )\u22121V (cid:62). Recall that (cid:104)Ck, \u02dcDj(cid:105) = 0,\u2200 k, j implies that each column of X is orthogonal to\n\nY\n\nProof of Theorem 1. For each n, let x(n) =\n\nAppendix\n\n(cid:163)\n\n(cid:163)\ny(N )(cid:62) (cid:164)(cid:62)\n\nx(1)(cid:62) \u00b7\u00b7\u00b7\n\n(cid:104)\n\n7\n\n\feach column of Z. Because r\u2217, r\u22a5\u2217, O are jointly Gaussian, as \u03c3r \u2192 \u221e, we have\n\n= argmin\n(r,r\u22a5)\n\n(cid:163)\n\nLet a(n) = G\n\nb(1)(cid:62) \u00b7\u00b7\u00b7\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)y(n) \u2212 K(cid:88)\n(cid:184)(cid:183)\n\nk=1\n\nN(cid:88)\n(cid:183) 1\n\nn=1\n\n1\n2\u03c32\nw\n\n(cid:184)\n\n\u2212\n\nrkx(n)\n\nZ\nIJ\n\nX 1\n\u03c3w\n\u03c3w\n0\n1\n\u03c3\u0001\n\u2212 1\n1 G2z(n), A =\n2\n\nr\nr\u22a5\n\n\u2212 1\n1 G2x(n), b(n) = G\n2\n\n. We have\nE[(cid:96)(ur(\u02dcx), \u02dcy)|O] = argmin\n\n1\nN\n\nr\n\nn=1\n\nj=1\n\nk \u2212 J(cid:88)\n(cid:184)(cid:176)(cid:176)(cid:176)(cid:176)2\n(cid:163)\nN(cid:88)\n\n=\n\n(cid:183)\n\n\u00afr\n\u00afr\u22a5\n\n1\nY\n\u03c3w\n0\n\n= argmin\n(r,r\u22a5)\n\n(cid:184)\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:183)\nb(N )(cid:62) (cid:164)(cid:62)\nN(cid:88)\nN(cid:88)\n\nn=1\n\nr\n\nr\n\n= argmin\n\n= argmin\n\nr\n\nn=1\n\n\u02c6r = argmin\n\nur(x(n))(cid:62)G1ur(x(n)) + ur(x(n))(cid:62)G2 E[\u02dcy|\u02dcx = x(n), O]\n\n1\n\n4 r(cid:62)a(n)(cid:62)a(n)r \u2212 1\n\n2 r(cid:62)a(n)(cid:62)(a(n)\u00afr + b(n)\u00afr\u22a5)\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)2\n\nJ(cid:88)\n\nj=1\n\nr\u22a52\n\nj\n\nr\u22a5\nj z(n)\n\nj\n\n(cid:34)\n\n+\n\n1\n2\u03c32\n\u0001\n\n(X(cid:62)X)\u22121X(cid:62)Y\n\n(Z(cid:62)Z + \u03c32\nw\n\u03c32\n\u0001\na(1)(cid:62) \u00b7\u00b7\u00b7\n\nI)\u22121Z(cid:62)Y\n\na(N )(cid:62) (cid:164)(cid:62)\n\n(cid:35)\n\n.\n\nE\u02dcy[(cid:96)(ur(\u02dcx), \u02dcy)|\u02dcx = x(n), O]\n\n= \u00afr + A\u2020B\u00afr\u22a5 = X\u2020Y + A\u2020B(Z(cid:62)Z + \u03c32\nw\n\u03c32\n\u0001\n\nI)\u22121Z(cid:62)Y.\n\nTaking \u03c3\u0001 \u2192 0 and \u03c3\u0001 \u2192 \u221e yields\n\n\u02c6r0 = X\u2020Y ,\n\u02c6r\u221e = X\u2020Y + A\u2020BZ\u2020Y .\n\nThe \ufb01rst part of the theorem then follows because\n\n\u02c6r0 = X\u2020Y = argmin\n\nr\n\n(cid:107)Y \u2212 Xr(cid:107)2 = argmin\n\nr\n\nn=1\n\nN(cid:88)\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)y(n) \u2212 K(cid:88)\n\n(cid:176)(cid:176)(cid:176)(cid:176)(cid:176)2\n\n.\n\nrkx(n)\n\nk\n\nk=1\n\n, B =\n\n(6)\n\n(7)\n(8)\n\n(cid:96)(ur(x(n)), y(n)) = argmin\n\nur(x(n))(cid:62)G1ur(x(n)) + ur(x(n))(cid:62)G2y(n)\n\nWe now prove the second part. Note that\n\nN(cid:88)\nr(cid:62)A(cid:62)Ar \u2212 2r(cid:62) N(cid:88)\n\nn=1\n\nargmin\n\nr\n\n= argmin\n\nr\n\nN(cid:88)\n\nn=1\n\nr\n\n(cid:163)\n\uf8ee\uf8ef\uf8f0 G(cid:62)\n\nn=1\n\nhk =\n\nh(n)(cid:62)y(n) = (A(cid:62)A)\u22121H(cid:62)Y,\n\nh(N )(cid:62) (cid:164)(cid:62)\n\uf8f9\uf8fa\uf8fb\n\n2 G\u22121\n\n1 G2Ck\u03c6(1)\n\n...\n\n2 G\u22121\nG(cid:62)\n\n1 G2Ck\u03c6(N )\n\nwhere h(n) = G(cid:62)\n\n2 G\u22121\n\n1 G2x(n) and H =\n\nh(1)(cid:62) \u00b7\u00b7\u00b7\n\n. Each kth column of H\n\nis in span{col X, col Z} because G(cid:62)\n1 G2Ck \u2208 (cid:60)M\u00d7Q = span{C1,\u00b7\u00b7\u00b7 , CK, \u02dcD1,\u00b7\u00b7\u00b7 , \u02dcDJ}.\n2 G\u22121\nSince the residual Y (cid:48) = Y \u2212 XX\u2020Y \u2212 ZZ\u2020Y upon projecting Y onto span {col X, col Z} is\nk Y (cid:48) = 0,\u2200 k and hence H(cid:62)Y (cid:48) = 0. This implies H(cid:62)Y =\northogonal to the subspace, we have h(cid:62)\nH(cid:62)XX\u2020Y + H(cid:62)ZZ\u2020Y . Further, since a(n)(cid:62)a(n) = h(n)(cid:62)x(n), a(n)(cid:62)b(n) = h(n)(cid:62)z(n),\u2200 n, we\nhave\n\n\u02c6r\u221e = X\u2020Y + A\u2020BZ\u2020Y = (A(cid:62)A)\u22121(cid:161)\n= (A(cid:62)A)\u22121(cid:161)\n\nH(cid:62)XX\u2020Y + H(cid:62)ZZ\u2020Y\n\n(cid:162)\n\nA(cid:62)AX\u2020Y + A(cid:62)BZ\u2020Y\n= (A(cid:62)A)\u22121H(cid:62)Y.\n\n(cid:162)\n\nProof of Theorem 2. Because (cid:104) \u02dcDi, \u02dcDj(cid:105) = 1{i = j}, we have Z(cid:62)Z = N I. Plugging this into (6)\nand comparing the resultant expression with (7) and (8) yield the desired result.\n\n8\n\n\fReferences\n[1] J.-Y. Audibert. Aggregated estimators and empirical complexity for least square regression.\n\nAnnales de l\u2019Institut Henri Poincare Probability and Statistics, 40(6):685\u2013736, 2004.\n\n[2] P. L. Bartlett and S. Mendelson. Empirical minimization. Probability Theory and Related\n\nFields, 135(3):311\u2013334, 2006.\n\n[3] D. Bertsimas and A. Thiele. Robust and data-driven optimization: Modern decision-making\n\nunder uncertainty. In Tutorials on Operations Research. INFORMS, 2006.\n\n[4] O. Besbes, R. Philips, and A. Zeevi. Testing the validity of a demand model: An operations\n\nperspective. 2007.\n\n[5] F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation for Gaussian regression. The\n\nAnnals of Statistics, 35(4):1674\u20131697, 2007.\n\n[6] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other\n\nlearning applications. Information and Computation, 100:78\u2013150, 1992.\n\n[7] K. Kim and N. Timm. Univariate and Multivariate General Linear Models: Theory and\n\nApplications with SAS. Chapman & Hall/CRC, 2006.\n\n[8] K. E. Muller and P. W. Stewart. Linear Model Theory: Univariate, Multivariate, and Mixed\n\nModels. Wiley, 2006.\n\n[9] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cam-\n\nbridge, MA, 1998.\n\n[10] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statistical\n\nSociety, 1996.\n\n9\n\n\f", "award": [], "sourceid": 369, "authors": [{"given_name": "Yi-hao", "family_name": "Kao", "institution": null}, {"given_name": "Benjamin", "family_name": "Roy", "institution": null}, {"given_name": "Xiang", "family_name": "Yan", "institution": null}]}