{"title": "Selecting Optimal Decisions via Distributionally Robust Nearest-Neighbor Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 749, "page_last": 759, "abstract": "This paper develops a prediction-based prescriptive model for optimal decision\nmaking that (i) predicts the outcome under each action using a robust\nnonlinear model, and (ii) adopts a randomized prescriptive policy determined\nby the predicted outcomes. The predictive model combines a new regularized\nregression technique, which was developed using Distributionally Robust\nOptimization (DRO) with an ambiguity set constructed from the Wasserstein\nmetric, with the K-Nearest Neighbors (K-NN) regression, which helps to\ncapture the nonlinearity embedded in the data. We show theoretical results\nthat guarantee the out-of-sample performance of the predictive model, and\nprove the optimality of the randomized policy in terms of the expected true\nfuture outcome. We demonstrate the proposed methodology on a hypertension\ndataset, showing that our prescribed treatment leads to a larger reduction in\nthe systolic blood pressure compared to a series of alternatives. A clinically\nmeaningful threshold level used to activate the randomized policy is also\nderived under a sub-Gaussian assumption on the predicted outcome.", "full_text": "Selecting Optimal Decisions via Distributionally\n\nRobust Nearest-Neighbor Regression\n\nDivision of Systems Engineering\n\nDepartment of Electrical and Computer Engineering\n\nRuidi Chen\n\nBoston University\nBoston, MA 02215\nrchen15@bu.edu\n\nIoannis Ch. Paschalidis \u2217\n\nDivision of Systems Engineering\n\nand Department of Biomedical Engineering\n\nBoston University\nBoston, MA 02215\nyannisp@bu.edu\n\nAbstract\n\nThis paper develops a prediction-based prescriptive model for optimal decision\nmaking that (i) predicts the outcome under each action using a robust nonlinear\nmodel, and (ii) adopts a randomized prescriptive policy determined by the predicted\noutcomes. The predictive model combines a new regularized regression technique,\nwhich was developed using Distributionally Robust Optimization (DRO) with\nan ambiguity set constructed from the Wasserstein metric, with the K-Nearest\nNeighbors (K-NN) regression, which helps to capture the nonlinearity embedded in\nthe data. We show theoretical results that guarantee the out-of-sample performance\nof the predictive model, and prove the optimality of the randomized policy in terms\nof the expected true future outcome. We demonstrate the proposed methodology\non a hypertension dataset, showing that our prescribed treatment leads to a larger\nreduction in the systolic blood pressure compared to a series of alternatives. A\nclinically meaningful threshold level used to activate the randomized policy is also\nderived under a sub-Gaussian assumption on the predicted outcome.\n\nIntroduction\n\n1\nSuppose we are given a discrete set of available actions [M ] (cid:44) {1, . . . , M}, and our goal is to choose\nm \u2208 [M ] such that the future outcome y is optimized. We are interested in \ufb01nding the optimal\ndecision with the aid of auxiliary data x \u2208 Rp that are concurrently observed, and correlated with\nthe uncertain outcome y. A main challenge with learning from observational data lies in the lack of\ncounterfactual information. One solution is to estimate/predict the effects of counterfactual policies\nby learning an action-dependent predictive model that groups the training samples based on their\nactions, and \ufb01ts a model in each group between the outcome y and the feature x. The predictions\nfrom this composite model can be used to determine the optimal action to take. The performance of\nthe prescribed decision hinges on the quality of the predictive model. We have observed that (i) there\nare often \u201coutliers\u201d in the data, especially in the medical applications motivating this work, caused\nby recording errors, missing values, and factors not captured in the data, and (ii) the underlying\nrelationship we try to learn is usually nonlinear and its parametric form is not known a priori. To deal\nwith these issues, a nonparametric robust learning procedure is in need.\nMotivated by the observation that individuals with similar features x would have similar outcomes y\nif they were to take the same action, we propose a predictive model that makes predictions based\non the outcomes of similar individuals/neighbors in each group of the training set. It is a nonlinear\nand nonparametric estimator which constructs locally linear (constant) curves based on the similarity\nbetween individuals. To \ufb01nd reasonable neighbors, we need to accurately identify the set of features\nthat are predictive of the outcome. We propose a regularized regression procedure for this task in\n\n\u2217http://sites.bu.edu/paschalidis\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fconsideration of the outliers that could potentially bias the estimation. Our prescriptive methodology\nis established on the basis of a regression informed K-Nearest Neighbors (K-NN) model [2] that\nevaluates the importance of features through regularized regression, and estimates the outcome by\naveraging over the neighbors identi\ufb01ed by a regression coef\ufb01cients-weighted distance metric.\nThe regularized regression has its root in Distributionally Robust Linear Regression (DRLR) with a\nWasserstein metric-based uncertainty set [13]. The K-NN model builds locally linear (and globally\nnonlinear) predictions using information from neighbors, accounting for the non-linearity that is\nnot captured by DRLR. Furthermore, it is easy to estimate and is ef\ufb01cient to solve. Our framework\nuses both parametric (DRLR) and nonparametric (K-NN) predictive models, producing robust\npredictions immunized against outliers and capturing the underlying non-linearity in the data. It\nis more information-ef\ufb01cient and more interpretable than the vanilla K-NN. We then develop a\nrandomized prescriptive policy that chooses each action m, whose predicted outcome is \u02c6ym(x), with\nj=1 e\u2212\u03be \u02c6yj (x), for some pre-speci\ufb01ed positive constant \u03be. As we will see,\n\nprobability e\u2212\u03be \u02c6ym(x)/(cid:80)M\n\nthis randomized strategy leads to a nearly optimal future outcome by an appropriate choice of \u03be.\nIn recent years there has been an emerging interest in combining ideas from machine learning\nwith operations research to develop a framework that uses data to prescribe optimal decisions\n[4, 16, 10, 22]. Current research has focused on applying machine learning methodologies to predict\nthe counterfactuals, based on which optimal decisions can be made. Local learning methods such\nas K-NN [2], LOESS (LOcally Estimated Scatterplot Smoothing) [15], CART (Classi\ufb01cation And\nRegression Trees) [12], and Random Forests [11] have been studied [4, 7, 8, 17, 9]. Extensions to\ncontinuous and multi-dimensional decision spaces with observational data were considered in [5].\nTo prevent over\ufb01tting, [6] proposed two robust prescriptive methods based on Nadaraya-Watson\nand nearest-neighbors learning. Deviating from such a predict-optimize paradigm, [3] presented\na new bandit algorithm based on the LASSO to learn a model of decision rewards conditional on\nindividual-speci\ufb01c covariates.\nOur method constructs a locally linear estimator of the future outcome through learning a robust\nmetric in the feature space. Different from the classical metric learning works (e.g., [20]), we solve a\ndownstream decision making problem by utilizing the information \ufb01ltered by the learned metric. [20]\nfocuses on the computational aspect of solving the metric regression problem. By contrast, we focus\non developing a novel method for the optimal decision making problem rather than improving the\nalgorithmic ef\ufb01ciency. Moreover, [20] studies only the regression problem, whereas we considered a\nricher framework of combining regression with a randomized prescriptive policy.\nOur problem is closely related to contextual bandits [14, 1, 23, 25] where an agent learns a sequence\nof decisions conditional on the contexts, with the aim of maximizing its cumulative reward. It has\nrecently found applications in learning personalized treatment for chronic diseases from mobile\nhealth data [24, 26, 27]. However, in this work, we learn the interaction between the context and\nrewards in each action group across similar individuals, not over the history of the same individual as\nin contextual bandits. Contextual bandits are most suitable for learning sequential strategies through\nrepeated interactions with the environment, which requires a substantial amount of historical data\nfor exploring the reward function and exploiting the promising actions. By contrast, our method\ndoes not require the availability of historical data, but instead learns the payoff function from similar\nindividuals. This can be viewed as a different type of exploration, i.e., when little information can\nbe acquired for the past states of an individual, investigating the behavior of similar subjects may\nbe bene\ufb01cial. This is essential for learning from Electronic Health Records (EHRs) available in the\nhospital, which do not include frequent patient data. For instance, we may observe a very sparse\ntreatment history for some patients, and the lag between patient visits is usually large.\nOur method is similar to K-NN regression with an Ordinary Least Squares (OLS)-weighted metric\nused in [7] to learn the optimal treatment for type-2 diabetic patients. The key differences lie in\nthat: (i) we adopt a robusti\ufb01ed regression procedure that is immunized against outliers and is thus\nmore stable and reliable; (ii) we propose a randomized prescriptive policy that adds robustness to\nthe methodology, whereas [7] deterministically prescribed the treatment with the best predicted\noutcome; (iii) we establish theoretical guarantees on the quality of the predictions and the prescribed\nactions, and (iv) the prescriptive rule in [7] was activated when the improvement of the recommended\ntreatment over the standard of care exceeded a certain threshold whereas our method looks into the\nimprovement over the previous regimen. This distinction makes our algorithm applicable in the\nscenario where the standard of care is unknown or ambiguous. Further, we derive a closed-form\n\n2\n\n\fexpression for the threshold level, which greatly improves the computational ef\ufb01ciency compared to\n[7] where a threshold was selected by cross-validation.\nThe remainder of the paper is organized as follows. In Sec. 2, we introduce the DRLR+K-NN model\nand present the performance guarantees on its predictive power. Sec. 3 develops the randomized\nprescriptive policy and proves its optimality in terms of the expected true outcome. Experimental\nresults using real medical (EHR) data are presented in Sec. 4. We conclude the paper in Sec. 5.\n\n2 DRLR Informed K-Nearest Neighbors\nGiven a feature vector x \u2208 Rp, and a set of M available actions [M ], our goal is to predict the future\noutcome ym(x) under each possible action m \u2208 [M ]. Assume the following relationship between\nthe features and the outcome:\n\nym = x(cid:48)\n\nm\u03b2\n\n\u2217\nm + hm(xm) + \u0001m,\n\nwhere prime denotes transpose, (xm, ym) represents the feature-outcome pair of an individual taking\nm is the coef\ufb01cient that captures the linear trend; hm(\u00b7) is a Lipschitz continuous\n\u2217\naction m; \u03b2\nnonlinear function (whose form is unknown) describing the nonlinear \ufb02uctuation in ym, and \u0001m is\nthe noise term with zero mean and standard deviation \u03b7m that expresses the intrinsic randomness of\nym and is assumed to be independent of xm.\nSuppose for each m \u2208 [M ], we observe Nm training samples (xmi, ymi), i = 1, . . . , Nm, that\n\u2217\ntake action m. To estimate \u03b2\nm, we adopt the robust formulation that was developed in [13]. A\nrobust model could lead to an improved out-of-sample performance, and accommodate the potential\n\u2217\nnonlinearity that is not explicitly revealed by the linear coef\ufb01cients \u03b2\nm, thus resulting in a more\naccurate assessment of the features. The DRLR model developed in [13] minimizes the worst-case\nabsolute loss within a distributional ambiguity set de\ufb01ned using the Wasserstein metric [18, 19] that\ncontains all possible perturbations on the distribution of the data. The robustness is achieved through\nhedging against this family of distributions. The learning problem is formulated as:\n\nEQm(cid:2)|ym \u2212 x(cid:48)\n\nm\u03b2m|(cid:3),\n\ninf\n\u03b2m\n\nsup\n\nQm\u2208\u2126m\n\n(1)\n\nwhere Qm is the probability distribution of (xm, ym), belonging to some set \u2126m de\ufb01ned as:\n\n\u2126m (cid:44) {Qm \u2208 M(Zm) : W1(Qm, \u02c6PNm) \u2264 rm},\n\nwhere Zm is the set of all possible values for (xm, ym); M(Zm) is the space of probability distri-\nbutions supported on Zm; \u02c6PNm is the uniform empirical distribution on the Nm observed samples\n(xmi, ymi), i = 1, . . . , Nm; rm is a pre-speci\ufb01ed parameter indicating the amount of ambiguity\nallowed; and W1(Qm, \u02c6PNm) is the order-1 Wasserstein distance between Qm and \u02c6PNm de\ufb01ned as:\n\nW1(Qm, \u02c6PNm ) = sup\nf\u2208L\n\nZm\n\nf (zm) Qm(dzm) \u2212\n\nf (zm) \u02c6PNm(dzm)\n\n,\n\nwhere zm = (xm, ym), and L is the space of all Lipschitz continuous functions satisfying |f (zm1) \u2212\nf (zm2)| \u2264 (cid:107)zm1 \u2212 zm2(cid:107)2, \u2200zm1, zm2 \u2208 Zm.\nWith Nm independently and identically distributed samples (xmi, ymi), i = 1, . . . , Nm, [13] has\nshown that problem (1) can be reformulated as:\n\n(cid:90)\n\nZm\n\n(cid:41)\n\n(cid:40)(cid:90)\n\nNm(cid:88)\n\ni=1\n\n1\nNm\n\ninf\n\u03b2m\n\n|ymi \u2212 x(cid:48)\n\nmi\u03b2m| + rm(cid:107)(\u2212\u03b2m, 1)(cid:107)2.\n\n(2)\n\n\u2217\nSolving Eq. (2) gives us a robust estimator of the linear regression coef\ufb01cient \u03b2\nm, which we denote\nby \u02c6\u03b2m. The elements of \u02c6\u03b2m measure the relative signi\ufb01cance of the predictors in determining the\noutcome ym. We feed the estimator into the nonlinear non-parametric K-NN regression model, by\nconsidering the following \u02c6\u03b2m-weighted metric:\n\n(cid:113)\n\n(cid:107)x \u2212 xmi(cid:107) \u02c6Wm\n\n=\n\n(x \u2212 xmi)(cid:48) \u02c6Wm(x \u2212 xmi),\n\n(3)\n\n3\n\n\fwhere \u02c6Wm is a diagonal matrix with elements ( \u02c6\u03b2m1)2, . . . , ( \u02c6\u03b2mp)2, with \u02c6\u03b2mi the i-th element of \u02c6\u03b2m.\nFor a new test sample x, within each group m, we \ufb01nd its Km nearest neighbors using the weighted\ndistance function (3). The predicted future outcome for x under action m, denoted by \u02c6ym(x), is\ncomputed as the average response among the Km nearest neighbors, i.e.,\n\nKm(cid:88)\n\ni=1\n\n\u02c6ym(x) =\n\n1\nKm\n\nym(i),\n\n(4)\n\nwhere ym(i) is the response of the i-th closest sample to x within group m. Eq. (4) computes a K-NN\nestimate of the future outcome by using the linear regression coef\ufb01cients weighted distance function,\nwhich can be viewed as a locally smoothed estimator in the neighborhood of x. The intuition for\nusing the \u02c6\u03b2m-weighted metric is to amplify the weight of features that are most predictive of ym\nand downweight the unimportant ones. As a result, the selected samples are close to x in the most\nrelevant features, and their corresponding response values should serve as a good approximation.\nNotice that Eq. (4) treats all neighbors equally by using the same weight. An alternative is to take\na distance-weighted average of the responses of neighbors; we have numerically tried this strategy\non our medical datasets in Section 4, but we \ufb01nd that its effect is not signi\ufb01cantly different from\nthe strategy where a uniform average of the responses is taken. We also want to point out that the\nfollowing theoretical analysis can be easily adapted to the weighted average response prediction.\nWe next show that Eq. (4) provides a good prediction in the sense of Mean Squared Error (MSE).\nThe bias-variance decomposition implies the following:\n\n(cid:1) (cid:44) E(cid:104)(cid:0)\u02c6ym(x) \u2212 ym(x)(cid:1)2(cid:12)(cid:12)(cid:12)x, xmi, i = 1, . . . , Nm\n(cid:105)\nMSE(cid:0)\u02c6ym(x)(cid:12)(cid:12)x, xmi, i = 1, . . . , Nm\n(cid:20)(cid:16) 1\n(cid:1)(cid:17)2(cid:12)(cid:12)(cid:12)x, xmi, \u2200i\n(cid:1) \u2212(cid:0)x(cid:48)\u03b2\nm + hm(xm(i))(cid:1)(cid:17)2\n(cid:16)\nKm(cid:88)\n(cid:0)x(cid:48)\nm + hm(x) \u2212 hm(xm(i))(cid:1)(cid:17)2\n(cid:16) 1\n(cid:0)(x \u2212 xm(i))(cid:48)\u03b2\n\nm + hm(x) \u2212 1\n\u2217\nKm(cid:88)\nKm\n\n\u2217\nm + hm(xm(i)) + \u0001m(i)\n\n(cid:21)\n\n(5)\n\n\u2217\nm + hm(x) + \u0001m\n\nKm(cid:88)\n(cid:0)x(cid:48)\n\n\u03b72\nm\nKm\n\n=\n\n=\n\n\u2217\n\nm(i)\u03b2\n\n+ \u03b72\n\nm,\n\nKm\n\ni=1\n\ni=1\n\n\u2217\n\n+\n\n+ \u03b72\nm\n\nm(i)\u03b2\n\n= E\n\nx(cid:48)\u03b2\n\nKm\n\ni=1\n\n+\n\n\u03b72\nm\nKm\n\nwhere ym(x) is the true future outcome on x if action m is prescribed; and xm(i), \u0001m(i) are the\nfeature vector and the noise term corresponding to the i-th closest sample to x within group m,\nrespectively. The third equality comes from the fact that the error term is independent of the features.\nFor each m \u2208 [M ], we aim to provide a probabilistic bound for 5 w.r.t. the measure of the Nm\ntraining samples. By examining the \ufb01rst term of the last line of (5), we see that for MSE to be small,\nm \u2212 \u02c6\u03b2m(cid:107)2 is small; (ii) (cid:107)x \u2212 xm(i)(cid:107) \u02c6Wm\nthe following three conditions need to be satis\ufb01ed: (i) (cid:107)\u03b2\n\u2217\nis\nsmall for i = 1, . . . , Km; and (iii) hm(x) \u2212 hm(xm(i)) is small for i = 1, . . . , Km. Below we state\nthe assumptions that are needed to establish the result.\nAssumption A (cid:107)(xm, ym)(cid:107)2 \u2264 Rm a.s..\nAssumption B (cid:107)(\u2212\u03b2m, 1)(cid:107)2 \u2264 \u00afBm.\nAssumption C For some set A(\u03b2\nm, 1)(cid:107)2} \u2229 Sp+1 and\nm) = cone{v| (cid:107)(\u2212\u03b2\n\u2217\n\u2217\nsome positive scalar \u03b1m, where Sp+1 is the unit sphere in the (p + 1)-dimensional Euclidean space,\n\nm, 1) + v(cid:107)2 \u2264 (cid:107)(\u2212\u03b2\n\u2217\n\nv(cid:48)ZmZ(cid:48)\n\nmv \u2265 \u03b1m,\n\ninf\n\nv\u2208A(\u03b2\u2217\n\nm)\n\nwhere Zm = [zm1 \u00b7\u00b7\u00b7 zmNm ] is the matrix with columns zm1, . . . , zmNm, with zmi = (xmi, ymi).\nAssumption D (xm, ym) is a centered sub-Gaussian random vector, i.e., it has zero mean and\nsatis\ufb01es the following condition:\n\n|||(xm, ym)|||\u03c82\n\n= sup\nu\u2208Sp+1\n\n|||(xm, ym)(cid:48)u|||\u03c82\n\n\u2264 \u00b5m.\n\n4\n\n\fAssumption E The covariance matrix of (xm, ym) has bounded positive eigenvalues. Set \u0393m =\nE[(xm, ym)(xm, ym)(cid:48)]; then,\n\n0 < \u03bbm0 (cid:44) \u03bbmin(\u0393m) \u2264 \u03bbmax(\u0393m) (cid:44) \u03bbm1 < \u221e.\n\nTo see the validity of the above assumptions, notice that with standardized data, Assumptions A and B\nare easily satis\ufb01ed. Assumptions C, D, and E bound the variance of (x, y) in terms of the eigenvalues\nof its covariance matrix and its sub-Gaussian norm. (If the variance in the data is prohibitively high,\nthe samples would contain little information to learn from.) Due to limited space, we defer the\nintermediate results that bound (cid:107)\u03b2\nto the supplementary. But those\nresults will be used as the foundation to derive the bound on the MSE of \u02c6ym(x).\n\nTheorem 2.1 Suppose we are given Nm i.i.d. copies of (xm, ym), denoted by (xmi, ymi), i =\n1, . . . , Nm, where xm has independent, centered coordinates, and\n\nm \u2212 \u02c6\u03b2m(cid:107)2 and (cid:107)x \u2212 xm(i)(cid:107) \u02c6Wm\n\u2217\n(cid:17)\n\n(cid:16)\n\ncov(xm) = diag\n\n\u03c32\nm1, . . . , \u03c32\n\nmp\n\n.\n\nGiven a \ufb01xed predictor x = (x1, . . . , xp), and some scalar \u00afwm, assuming\n\n(cid:80)p\n\n1. hm(\u00b7) is Lipschitz continuous with a Lipschitz constant Lm on the metric spaces (Xm,(cid:107)\u00b7(cid:107)2)\n\nand (Ym,| \u00b7 |), where Xm,Ym are the domain and codomain of hm(\u00b7), respectively.\nm > \u00afB2\n\n2. \u00afw2\nj=1(\u03c32\n3. |(xmij \u2212 xj)2 \u2212 (\u03c32\n4. The coordinates of any feasible solution to (2) have absolute values greater than or equal to\n\nj ), where \u00afBm is speci\ufb01ed in Assumption B.\nj )| \u2264 Tm, \u2200i, j, where xmij is the j-th component of xmi.\n\nmj + x2\n\nmj + x2\n\nm\n\nsome positive number bm (dense estimators).\n\nUnder Assumptions A, B, C, D, E, when Nm \u2265 nm, with probability at least \u03b4m \u2212 I1\u2212pm0 (Nm \u2212\nKm + 1, Km) w.r.t. the measure of samples,\n\nbm\n\n+\n\n\u221a\n\np \u00afwm + Lm \u00afwm\n\u00afBm\n\n(\u02c6ym(x)\u2212ym(x))2(cid:12)(cid:12)(cid:12)x, xmi, i = 1, . . . , Nm\nE(cid:104)\n(cid:105) \u2264\n(cid:17)2\nand for any a \u2265(cid:16) \u00afwm\u03c4m\n(cid:12)(cid:12)(cid:12)x, xmi, i = 1, . . . , Nm\nP(cid:16)(cid:0)\u02c6ym(x)\u2212ym(x)(cid:1)2 \u2265 a\n(cid:16)\n(cid:16) Tm\n(cid:17)\n\npm0 = 1 \u2212 exp\n\n\u2212 \u03c32\nm\nT 2\nm\n\n(cid:18)\n\n(cid:16)\n\nwith\n\ng\n\nvar\n\n(xmij \u2212 xj)2\n\n,\n\n(cid:118)(cid:117)(cid:117)(cid:116) p(cid:88)\n\n\u03c3m =\n\n+ \u03b72\nm\nKm\n\n\u221a\n\n+\n\nbm\n\nm,\n+ \u03b72\n\n(cid:18) \u00afwm\u03c4m\n(cid:16) \u00afwm\u03c4m\nm \u2212(cid:80)\n\n(cid:17) \u2264\n\nbm\n\n\u03c32\nm\n\nwhere I1\u2212pm0(\u00b7,\u00b7) is the regularized incomplete beta function, and\nj(\u03c32\n\nm/ \u00afB2\n\u00afw2\n\nmj + x2\nj )\n\n(cid:19)2\n\na\n\n(cid:17)\n\n(cid:17)(cid:19)\n\n,\n\ng(u) = (1 + u) log(1 + u) \u2212 u.\n\np \u00afwm +\n\nLm \u00afwm\n\n\u00afBm\n\n+\n\n\u03b72\nm\nKm\n\n+\u03b72\n\nm, (6)\n\n(cid:17)2\n\n\u221a\n\n+\n\np \u00afwm + Lm \u00afwm\n\u00afBm\n\n+ \u03b72\nm\nKm\n\n+ \u03b72\nm\n\n,\n(7)\n\nj=1\n\nThe notations nm, \u03b4m, and \u03c4m come from a simpli\ufb01ed version of Theorem 3.11 in [13], which states\nthat when the sample size Nm \u2265 nm, with probability at least \u03b4m,\n\n(cid:107)\u03b2\n\nm \u2212 \u02c6\u03b2m(cid:107)2 \u2264 \u03c4m.\n\u2217\n\nThe parameters nm, \u03b4m, \u03c4m are related to the sub-Gaussian norm of (xm, ym), the eigenvalues of\n\u2217\nthe covariance matrix of (xm, ym), and the geometric structure of the true regression coef\ufb01cient \u03b2\nm.\nRemark 2.1 The expectation in (6) and the probability in (7) are w.r.t. the measure of the noise \u0001m.\nThm. 2.1 essentially says that for any given x, with a high probability (w.r.t. the measure of samples),\nthe prediction is close to the true future outcome. The prediction bias is determined by the accuracy\n\n5\n\n\fmj, j = 1, . . . , p, which are assumed to be at least b2\n\nof the linear coef\ufb01cient estimate, the similarity between the individual in query and its K nearest\nneighbors, the dimensionality of data, and the smoothness of the regression hypothesis.\nRemark 2.2 The dependence on bm in the upper bound provided by (6) is due to the fact that \u02c6Wm\nhas diagonal elements \u02c6\u03b22\nm. If we multiply \u02c6Wm\nby a very large number, the neighbor selection criterion is not affected, since the relative signi\ufb01cance\nof the predictors stays unchanged, but the bm appearing in (6) would be replaced by a very large\nnumber, diminishing the effect of the \ufb01rst term in the parenthesis, at the price of increasing \u00afBm and\n\u00afwm, which in turn has an effect on the number of neighbors that are needed. It might be interesting to\nexplore this implicit trade-off and \ufb01nd the optimal \u02c6Wm to achieve the smallest MSE. For simplicity,\nwe just use \u02c6Wm = diag( \u02c6\u03b22\nRemark 2.3 We offered similar insights to [20] for the generalization bounds. Theorem 5.1 in [20]\nprovided a risk bound that depends on the empirical risk (re\ufb02ected in \u03c4m and \u00afwm of our bound), the\ndimensionality of data (p), and the smoothness of the regression hypothesis (Lm).\n\nmp) in this work.\n\nm1, . . . , \u02c6\u03b22\n\n3 Prescriptive Policy Development\n\nWe now proceed to develop the prescriptive policy with the aim of minimizing the future outcome. A\nnatural idea is to take the action that yields the minimum predicted outcome. To allow for \ufb02exibility in\nexploring alternatives that have a comparable performance, and also to correct for potential prediction\nerrors that might mislead the ranking of actions, we propose a randomized policy that prescribes\neach action with a probability inversely proportional to its exponentiated predicted outcome. It can\nbe viewed as an of\ufb02ine Hedge algorithm [21] that increases the robustness of our method through\nexploration. Speci\ufb01cally, given an individual with a feature vector x, and her predicted future outcome\nunder each action m, denoted by \u02c6ym(x), we consider a randomized policy that chooses action m\nj=1 e\u2212\u03be \u02c6yj (x), with \u03be being some pre-speci\ufb01ed positive constant. We\n\nwith probability e\u2212\u03be \u02c6ym(x)/(cid:80)M\nprescribes action m with probability e\u2212\u03be \u02c6ym(x)/(cid:80)M\n\nwould like to explore properties of this policy in terms of its expected true outcome.\nTheorem 3.1 Given any \ufb01xed predictor x \u2208 Rp, denote its predicted and true future outcome under\naction m by \u02c6ym(x) and ym(x), respectively. Assume that we adopt a randomized strategy that\nj=1 e\u2212\u03be \u02c6yj (x), for some \u03be \u2265 0. Assume \u02c6ym(x) and\nym(x) are non-negative, \u2200m. For any k \u2208 [M ], the expected true outcome of this policy satis\ufb01es:\n\nM(cid:88)\n\nm=1\n\n(cid:18)\n(cid:80)\ne\u2212\u03be \u02c6ym(x)\nj e\u2212\u03be \u02c6yj (x) ym(x) \u2264 yk(x) +\n(cid:18) 1\nM(cid:88)\n\n+ \u03be\n\nM\n\nm=1\n\n\u02c6y2\nm(x) +\n\n\u02c6yk(x) \u2212 1\nM\n\n\u02c6ym(x)\n\n(cid:19)\n\nM(cid:88)\nM(cid:88)\n(cid:80)\n\nm=1\n\nm=1\n\n(cid:19)\n\ne\u2212\u03be \u02c6ym(x)\nj e\u2212\u03be \u02c6yj (x) y2\n\nm(x)\n\n(8)\n\n+\n\nlog M\n\n\u03be\n\n.\n\nTheorem 3.1 says that the expected true outcome of the randomized policy is no worse than the true\noutcome of any action k plus two components, one accounting for the gap between the predicted\noutcome under k and the average predicted outcome, and the other depending on the parameter \u03be.\nThinking about choosing k = arg minm ym(x), if \u02c6yk(x) is below the average predicted outcome\n(which should be true if we have an accurate prediction), it follows from (8) that the randomized\npolicy leads to a nearly optimal future outcome by an appropriate choice of \u03be.\nIn medical applications, when determining the future prescription for a patient, we usually have\naccess to some auxiliary information such as the current prescription she is receiving, and her current\nmeasurements. In consideration of the health care costs and treatment transients, it is not desired to\nswitch patients\u2019 treatments too frequently. We thus set a threshold level for the expected improvement\nin the outcome, below which the randomized strategy will be rendered inactive and the current\n\u2212\u03be \u02c6yj (x) \u02c6yk(x) \u2264 xco \u2212 T (x), mf(x) = m w.p.\nj=1 e\u2212\u03be \u02c6yj (x); otherwise mf(x) = mc(x), where mf(x) and mc(x) are the future and\ncurrent prescriptions for patient x, respectively; m is the prescribed action under the randomized\npolicy; xco represents the current observed outcome (e.g., current blood pressure), which is assumed\nto be one of the components of x; and T (x) is some threshold level which will be determined later.\nThis prescriptive rule basically says that the randomized strategy will be activated only if the expected\nimprovement relative to the current observed outcome is signi\ufb01cant.\n\ntherapy will be continued. Speci\ufb01cally, if(cid:80)\ne\u2212\u03be \u02c6ym(x)/(cid:80)M\n\ne\u2212\u03be \u02c6yk (x)\nj e\n\n(cid:80)\n\nk\n\n6\n\n\fTheorem 3.2 Assume that the distribution of the predicted outcome \u02c6ym(x) conditional on x, is\n2Cm(x), for any m \u2208 [M ] and any x. Given a small\nsub-Gaussian, and its \u03c82-norm is equal to\n0 < \u00af\u0001 < 1, in order to satisfy\n\n\u221a\n\nP\n\n(cid:32)(cid:88)\n(cid:16)\nwhere \u00b5\u02c6ym(x) = E[\u02c6ym(x)|x].\n\nit suf\ufb01ces to set a threshold\n\nT (x) = max\n\nk\n\n(cid:33)\n\n\u2264 \u00af\u0001,\n\n(cid:80)\n\n\u02c6yk(x) > xco \u2212 T (x)\n\ne\u2212\u03be \u02c6yk(x)\nj e\u2212\u03be \u02c6yj (x)\n(cid:16)\nxco \u2212 \u00b5\u02c6ym (x) \u2212(cid:112)\u22122C 2\n\n0, min\nm\n\n(cid:17)(cid:17)\n\nm(x) log(\u00af\u0001/M )\n\n,\n\nTheorem 3.2 \ufb01nds the largest threshold T (x) such that the probability of the expected improvement\nbeing less than T (x) is small. The parameters \u00b5\u02c6ym(x) and Cm(x) for m \u2208 [M ] can be estimated by\nsimulation through random sampling from a subset of the training examples.\n\nAlgorithm 1 Estimating the conditional mean and standard deviation of the predicted outcome.\n\nInput: a feature vector x; am: the number of subsamples used to compute \u02c6\u03b2m, am < Nm; dm:\nthe number of repetitions.\nfor i = 1, . . . , dm do\n\nRandomly pick am samples from group m, and use them to estimate a robust regression\n\ncoef\ufb01cient \u02c6\u03b2mi through solving 2.\n\nThe future outcome for x under action m is predicted as \u02c6ymi(x) = x(cid:48) \u02c6\u03b2mi.\n\nend for\nOutput: Estimate the conditional mean of \u02c6ym(x) as:\n\nand the conditional standard deviation as:\n\ndm(cid:88)\n\ni=1\n\n1\ndm\n\n\u00b5\u02c6ym(x) =\n\n\u02c6ymi (x),\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\ndm \u2212 1\n\ndm(cid:88)\n\n(cid:16)\n\n(cid:17)2\n\u02c6ymi(x) \u2212 \u00b5\u02c6ym (x)\n\n.\n\ni=1\n\nCm(x) =\n\nA Special Case. As \u03be \u2192 \u221e, the randomized policy will assign probability 1 to the action with the\nlowest predicted outcome, which is equivalent to the following deterministic policy:\n\nmf(x) =\n\narg min\nm\nmc(x),\n\n\u02c6ym(x) \u2264 xco \u2212 T (x),\n\n\u02c6ym(x), if min\nm\n\notherwise.\n\nA slight modi\ufb01cation to the threshold level T (x) is given below:\n\n(cid:16)\n\nxco \u2212 \u00b5\u02c6ym(x) \u2212(cid:112)\u22122C 2\n\n(cid:17)(cid:17)\n\nm(x) log \u00af\u0001\n\n.\n\nT (x) = max\n\n0, min\nm\n\n(cid:40)\n\n(cid:16)\n\n4 Numerical Results on a Hypertension Dataset\n\nIn this section, we will apply our method to develop optimal prescriptions for patients with hyperten-\nsion. The data used for the study come from a large academic hospital system handling more than 1\nmillion patient visits per year and consist of Electronic Health Records (EHR) containing the patients\u2019\nmedical history in the period 1999\u20132014. Our goal is to \ufb01nd the treatment that minimizes the future\nsystolic blood pressure based on the medical histories.\n\n7\n\n\f4.1 Dataset Description\n\nAccording to certain cohort selection criteria (see the supplementary), we have identi\ufb01ed 49,401\npatients who have been diagnosed with hypertension. Each patient may have multiple entries in\nher/his medical record. To capture the period when the patient was experiencing the effect of the drug\nregimen, we de\ufb01ne the line of therapy as a time period (between 200 and 500 days) during which the\ncombination of drugs prescribed to the patient does not change.\n\nPatient Visits. During each line of therapy, we split the treatment history of each patient into\nseveral patient visits, to re\ufb02ect changes in the features and outcomes. The patient visits are considered\nto be occurring every 70 days. The measurements and lab tests are averaged over the 10 days prior to\nthe visit. We de\ufb01ne the current prescription of each visit as the combination of drugs that were given\nduring the 10 days immediately preceding the visit, and the standard of care as the drug regimen that\nis prescribed by the doctors at the time of the visit. The future outcome of the visit is computed as the\naverage systolic blood pressure in mmHg 70 to 180 days after it. Patient visits that contain missing\nvalues for the outcome are dropped. We have obtained 26,128 valid visits, each with 63 features.\n\nFeatures. The features for building the predictive model include: (i) demographics: sex, age and\nrace; (ii) measurements: systolic blood pressure and diastolic blood pressure, Body Mass Index (BMI)\nand pulse; (iii) lab tests: blood chemistry tests and hematology tests; and (iv) diagnosis history.\n\nPrescriptions. We consider six types of prescriptions for hypertension, each corresponding to a\ndifferent medication that could be prescribed: ACE inhibitor, Angiotensin Receptor Blockers (ARB),\ncalcium channel blockers, thiazide and thiazide-like diuretics, \u03b1-blockers and \u03b2-blockers. The patient\nvisits are grouped based on their standard of care.\n\n4.2 Model Development and Results\n\nWe will compare our algorithm with several alternatives that replace our DRLR informed K-NN with\na different predictive model such as LASSO, CART, and OLS informed K-NN [7]. Both deterministic\nand randomized prescriptive policies are considered using predictions from these models.\n\nParameter tuning. Within each prescription group, we randomly split the patient visits into three\nsets: a training set (80%), a validation set (10%), and a test set (10%). To re\ufb02ect the dependency\nof the number of neighbors on the number of training samples, we perform a linear regression\nbetween these two quantities, which we use to determine the number of neighbors needed in different\nsettings. To tune the exponent \u03be for the randomized strategy, we need to evaluate the effects of\ncounterfactual treatments. We assess the predictive power of a series of robust predictive models\n(see the supplementary) in terms of their R2 and out-of-sample estimation errors, and select the\nDRLR+K-NN model (imputation model) that excels in all metrics, to impute the outcome for an\nunobservable treatment m, using the validation set.\nWhen comparing the predictive performance of the models (Table 1, Supplementary), we \ufb01t a common\nregression model to all patients without dividing them into groups, with the prescription being used\nas a predictor. This leads to a signi\ufb01cant reduction in the unexplained variance of y, and thus the\nadvantages of DRLR+K-NN are not signi\ufb01cant. However, when we do groupwise regression where\nprescription is not used as a predictor, the unexplained noise increases, robustness becomes more\ncritical, and thus the advantages of our method become more prominent (see the following Table 1).\n\nModel training. We solve the predictive models on the whole training set with the best tuned\nparameters, the output of which is used to develop the optimal prescriptions for the test set patients.\nThe parameter \u00af\u0001 in the threshold T (x) is set to 0.1. We compute the average improvement (reduction)\nin outcomes for patients in the test set, which is de\ufb01ned to be the difference between the (expected)\nfuture outcome under the recommended therapy and the current observed outcome. If the recommen-\ndation does not match the standard of care, its future outcome is estimated through the imputation\nmodel that was discussed earlier, where Km should be selected to \ufb01t the size of the test set.\n\nRe\ufb01nement of the policy.\nIn consideration of the sensitivity of K-NN to the number of neighbors,\nwe propose a re\ufb01nement of our model, where a patient-speci\ufb01c number of neighbors Km is used,\n\n8\n\n\fand the neighbors that are relatively far away from the patient in query are discarded. This can be\nconsidered as taking a weighted average of the responses of neighbors to make the K-NN prediction.\nthe distance between the patient in query and her i-th closest neighbor in\nSpeci\ufb01cally, denote by dm\ni\n2 \u2264 . . . dm\ngroup m; we know dm\n. Given some\nthreshold \u02dcT , the number of neighbors K(cid:48)\n\n. De\ufb01ne j\u2217\nm will be determined as follows.\n\nj \u2212(cid:80)j\u22121\n\nm = arg maxj\n\n1 \u2264 dm\n\ndm\nj\u22121\ni\n\n(cid:16)\n\n(cid:17)\n\ndm\n\nKm\n\ni=1\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3j\u2217\n\n\u2212(cid:80)j\u2217\n(cid:80)j\u2217\n\nm\u22121\n\ni=1\n\ndm\nj\u2217\ni\nm\u22121\ndm\nj\u2217\nm\u22121\n\ni\n\nm\n\nm \u2212 1, if dm\nj\u2217\nKm,\n\nm\u22121\notherwise.\n\ni=1\n\nK(cid:48)\n\nm =\n\n> \u02dcT ,\n\nResults and discussion. The reductions in outcomes (future minus current) for various models\nare shown in Table 1. The columns indicate the prescriptive policies (deterministic or randomized);\nthe rows represent the predictive models whose outcomes \u02c6ym(x) serve as inputs to the prescriptive\nalgorithm. We compare two strategies that use different rules for selecting the number of neighbors,\nwith a validated threshold \u02dcT = 1 for the patient-speci\ufb01c strategy. We test the performance of all\nalgorithms over \ufb01ve repetitions, each with a different training set. We also list the reductions in\noutcomes resulting from the standard of care, and the current prescription which continues the\ncurrent drug regimen.\nSeveral observations are in order: (i) all models outperform the current prescription and the standard\nof care; (ii) the DRLR+K-NN model leads to the largest reduction in outcomes with a relatively\nstable performance; (iii) using a patient-speci\ufb01c K(cid:48)\nm in general leads to a more signi\ufb01cant reduction\nin outcomes, and (iv) the randomized policy achieves a similar (slightly better) performance than the\ndeterministic one. Overall, the best DRLR+K-NN model leads to a 69% reduction in future systolic\nblood pressure compared to the 2nd best model. We expect the randomized strategy to win when\nthe effects of several treatments do not differ much, in which case the deterministic algorithm might\nproduce misleading results. The randomized policy could potentially improve the out-of-sample\nperformance, as it gives the \ufb02exibility of exploring options that are suboptimal on the training set, but\nmight be optimal on the test set.\nIt may be argued that the observed improvement is due to the evaluation model we choose; speci\ufb01cally,\nusing DRLR+K-NN to assess the performance of all candidates might cause bias that favors our\nmethod. To mitigate this bias, we also used a mixture of OLS+K-NN and DRLR+K-NN (with equal\nweights) as the imputation model, given that they achieve the best predictive performance. Under this\nscheme, our model still outperforms all others.\n\nTable 1: The reduction in future systolic blood pressured (mmHg); mean (standard deviation).\n\nm\n\nTraining with a patient-speci\ufb01c K(cid:48)\nDeterministic\n-4.34 (0.28)\n-4.46 (0.46)\n-4.30 (0.35)\n-7.42 (0.46)\n\nRandomized\n-4.33 (0.28)\n-4.49 (0.50)\n-4.30 (0.32)\n-7.58 (0.51)\n\n-2.56 (0.14)\n-2.37 (0.11)\n\nTraining with a uniform Km\nDeterministic Randomized\n-4.22 (0.19)\n-4.22 (0.20)\n-4.48 (0.55)\n-4.51 (0.49)\n-4.29 (0.31)\n-4.27 (0.32)\n-6.58 (0.70)\n-6.78 (0.73)\n\n-2.50 (0.16)\n-2.37 (0.11)\n\nLASSO\nCART\n\nOLS+K-NN\nDRLR+K-NN\n\nCurrent prescription\n\nStandard of care\n\n5 Conclusions\n\nWe developed a prediction-based prescriptive method that determines the probability of taking each\naction based on the predictions from a DRLR informed K-NN model. Theoretical guarantees on the\nout-of-sample performance of the predictive model and the optimality of the prescriptive algorithm\nwere established. We also derived a closed-form expression for the threshold level that is used to\nactivate the randomized policy. The proposed approach was applied to actual hypertension patient data\nobtained from a major academic hospital system, providing numerical evidence for the superiority of\nour algorithm in terms of the improvement in outcomes.\n\n9\n\n\fAcknowledgments\n\nThe research was partially supported by the NSF under grants IIS-1914792, DMS-1664644, and\nCNS-1645681, by the ONR under grant N00014-19-1-2571, by the NIH under grant 1R01GM135930,\nby the Boston University Data Science Initiative, and by the BU Center for Information and Systems\nEngineering.\n\nReferences\n[1] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire.\nTaming the monster: A fast and simple algorithm for contextual bandits. In International\nConference on Machine Learning, pages 1638\u20131646, 2014.\n\n[2] Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression.\n\nThe American Statistician, 46(3):175\u2013185, 1992.\n\n[3] Hamsa Bastani and Mohsen Bayati. Online decision-making with high-dimensional covariates.\n\n2015.\n\n[4] Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. arXiv preprint\n\narXiv:1402.5481, 2014.\n\n[5] Dimitris Bertsimas and Christopher McCord. Optimization over continuous and multi-\n\ndimensional decisions with observational data. arXiv preprint arXiv:1807.04183, 2018.\n\n[6] Dimitris Bertsimas and Bart Van Parys. Bootstrap robust prescriptive analytics. arXiv preprint\n\narXiv:1711.09974, 2017.\n\n[7] Dimitris Bertsimas, Nathan Kallus, Alexander M Weinstein, and Ying Daisy Zhuo. Personalized\ndiabetes management using electronic medical records. Diabetes care, page dc160826, 2016.\n\n[8] Dimitris Bertsimas, Jack Dunn, and Nishanth Mundru. Optimal prescriptive trees. INFORMS\n\nJournal on Optimization, 2018.\n\n[9] Max Biggs and Rim Hariss. Optimizing objective functions determined from random forests.\n\n2018.\n\n[10] Fernanda Bravo and Yaron Shaposhnik. Discovering optimal policies: A machine learning\n\napproach to model analysis. 2017.\n\n[11] Leo Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n\n[12] Leo Breiman. Classi\ufb01cation and regression trees. Routledge, 2017.\n\n[13] Ruidi Chen and Ioannis Ch Paschalidis. A robust learning approach for regression models based\non distributionally robust optimization. Journal of Machine Learning Research, 19(13), 2018.\n\n[14] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff\nfunctions. In Proceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 208\u2013214, 2011.\n\n[15] William S Cleveland and Susan J Devlin. Locally weighted regression: an approach to regression\nanalysis by local \ufb01tting. Journal of the American statistical association, 83(403):596\u2013610,\n1988.\n\n[16] Dick den Hertog and Krzysztof Postek. Bridging the gap between predictive and pre-\nscriptive analytics-new optimization methodology needed. Technical report, Technical re-\nport, Tilburg University, Netherlands, 2016. Available at: http://www. optimization-online.\norg/DB HTML/2016/12/5779. html, 2016.\n\n[17] Jack Dunn. Optimal Trees for Prediction and Prescription. PhD thesis, PhD thesis, Mas-\n\nsachusetts Institute of Technology, 2018. http://jack. dunn. nz/papers/Thesis. pdf, 2018.\n\n10\n\n\f[18] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization\nusing the Wasserstein metric: Performance guarantees and tractable reformulations. Mathemati-\ncal Programming, pages 1\u201352, 2017.\n\n[19] Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with Wasserstein\n\ndistance. arXiv preprint arXiv:1604.02199, 2016.\n\n[20] Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Ef\ufb01cient regression in metric\nspaces via approximate Lipschitz extension. IEEE Transactions on Information Theory, 63(8):\n4838\u20134849, 2017.\n\n[21] Elad Hazan. Introduction to online convex optimization. Foundations and Trends R(cid:13) in Opti-\n\nmization, 2(3-4):157\u2013325, 2016.\n\n[22] Suvrajeet Sen and Yunxiao Deng. Learning enabled optimization: Towards a fusion of statistical\nlearning and stochastic programming. INFORMS Journal on Optimization (submitted), 2018.\n\n[23] Aleksandrs Slivkins. Contextual bandits with similarity information. The Journal of Machine\n\nLearning Research, 15(1):2533\u20132568, 2014.\n\n[24] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile\n\nhealth. In Mobile Health, pages 495\u2013517. Springer, 2017.\n\n[25] Huasen Wu, R Srikant, Xin Liu, and Chong Jiang. Algorithms with logarithmic or sublinear\nregret for constrained contextual bandits. In Advances in Neural Information Processing Systems,\npages 433\u2013441, 2015.\n\n[26] Isaac Xia. The Price of Personalization: An Application of Contextual Bandits to Mobile Health.\n\nPhD thesis, 2018.\n\n[27] Feiyun Zhu, Jun Guo, Ruoyu Li, and Junzhou Huang. Robust actor-critic contextual bandit\nfor mobile health (mhealth) interventions. In Proceedings of the 2018 ACM International\nConference on Bioinformatics, Computational Biology, and Health Informatics, pages 492\u2013501.\nACM, 2018.\n\n11\n\n\f", "award": [], "sourceid": 387, "authors": [{"given_name": "Ruidi", "family_name": "Chen", "institution": "Boston University"}, {"given_name": "Ioannis", "family_name": "Paschalidis", "institution": "Boston University"}]}