{"title": "Matching on Balanced Nonlinear Representations for Treatment Effects Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 929, "page_last": 939, "abstract": "Estimating treatment effects from observational data is challenging due to the missing counterfactuals. Matching is an effective strategy to tackle this problem. The widely used matching estimators such as nearest neighbor matching (NNM) pair the treated units with the most similar control units in terms of covariates, and then estimate treatment effects accordingly. However, the existing matching estimators have poor performance when the distributions of control and treatment groups are unbalanced. Moreover, theoretical analysis suggests that the bias of causal effect estimation would increase with the dimension of covariates. In this paper, we aim to address these problems by learning low-dimensional balanced and nonlinear representations (BNR) for observational data. In particular, we convert counterfactual prediction as a classification problem, develop a kernel learning model with domain adaptation constraint, and design a novel matching estimator. The dimension of covariates will be significantly reduced after projecting data to a low-dimensional subspace. Experiments on several synthetic and real-world datasets demonstrate the effectiveness of our approach.", "full_text": "Matching on Balanced Nonlinear Representations for\n\nTreatment Effects Estimation\n\nSheng Li\n\nAdobe Research\n\nSan Jose, CA\n\nsheli@adobe.com\n\nYun Fu\n\nNortheastern University\n\nBoston, MA\n\nyunfu@ece.neu.edu\n\nAbstract\n\nEstimating treatment effects from observational data is challenging due to the\nmissing counterfactuals. Matching is an effective strategy to tackle this problem.\nThe widely used matching estimators such as nearest neighbor matching (NNM)\npair the treated units with the most similar control units in terms of covariates,\nand then estimate treatment effects accordingly. However, the existing matching\nestimators have poor performance when the distributions of control and treatment\ngroups are unbalanced. Moreover, theoretical analysis suggests that the bias of\ncausal effect estimation would increase with the dimension of covariates. In this\npaper, we aim to address these problems by learning low-dimensional balanced and\nnonlinear representations (BNR) for observational data. In particular, we convert\ncounterfactual prediction as a classi\ufb01cation problem, develop a kernel learning\nmodel with domain adaptation constraint, and design a novel matching estimator.\nThe dimension of covariates will be signi\ufb01cantly reduced after projecting data\nto a low-dimensional subspace. Experiments on several synthetic and real-world\ndatasets demonstrate the effectiveness of our approach.\n\n1\n\nIntroduction\n\nCausal questions exist in many areas, such as health care [24, 12], economics [14], political sci-\nence [17], education [36], digital marketing [6, 43, 5, 15, 44], etc. In the \ufb01eld of health care, it is\ncritical to understand if a new medicine could cure a certain illness and perform better than the old\nones. In political science, it is of great importance to evaluate whether the government should fund a\njob training program, by assessing if the program is the true factor that leads to the success of job\nhunting. All of these causal questions can be addressed by the causal inference technique. Formally,\ncausal inference estimates the treatment effect on some units after interventions [33, 20]. In the\nabove example of heath care, the units could be patients, and the intervention would be taking new\nmedicines. Due to the wide applications of causal questions, effective causal inference techniques are\nhighly desired to address these problems.\nGenerally, the causal inference problems can be tackled by either experimental study or observational\nstudy. Experimental study is popular in traditional causal inference problems, but it is time-consuming\nand sometimes impractical. As an alternative strategy, observational study has attracted increasing\nattention in the past decades, which extracts causal knowledge only from the observed data. Two\nmajor paradigms for observational study have been developed in computer science and statistics,\nincluding the causal graphical model [29] and the potential outcome framework [27, 33]. The former\nbuilds directed acyclic graphs (DAG) from covariates, treatment and outcome, and uses probabilistic\ninference to determine causal relationships; while the latter estimates counterfactuals for each treated\nunit, and gives a precise de\ufb01nition of causal effect. The equivalence of two paradigms has been\ndiscussed in [11]. In this paper, we mainly focus on the potential outcome framework.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fA missing data problem needs to be dealt with in the potential outcome framework. As each unit is\neither treated or not treated, it is impossible to observe its outcomes in both scenarios. In other words,\none has to predict the missing counterfactuals. A widely used solution to estimating counterfactuals\nis matching. According to the (binary) treatment assignments, a set of units can be divided into a\ntreatment group and a control group. For each treated unit, matching methods select its counterpart in\nthe control group based on certain criteria, and treat the selected unit as a counterfactual. Then the\ntreatment effect can be estimated by comparing the outcomes of treated units and the corresponding\ncounterfactuals. Some popular matching estimators include nearest neighbor matching (NNM) [32],\npropensity score matching [31], coarsened exact matching (CEM) [17], genetic matching [9], etc.\nExisting matching methods have three major drawbacks. First, they either perform matching in the\noriginal covariate space (e.g., NNM, CEM) or in the one-dimensional propensity score space (e.g.,\nPSM). The potential of using intermediate representations has not been extensively studied before.\nSecond, existing methods work well for data with a moderate number of covariates, but may fail\nfor data with a large number of covariates, as theoretical analysis suggests that the bias of treatment\neffect estimation would increase with the dimension of covariates [1]. Third, most matching methods\ndo not take into account whether the distributions of two groups are balanced or not. The matching\nprocess would make no sense if the distributions of two groups have little overlap.\nTo address the above problems, we propose to learn balanced and nonlinear representations (BNR)\nfrom observational data, and design a novel matching estimator named BNR-NNM. First, the\ncounterfactual prediction problem is converted to a multi-class classi\ufb01cation problem, by categorizing\nthe outcomes to ordinal labels. Then, we propose a novel criterion named ordinal scatter discrepancy\n(OSD) for supervised kernel learning on data with ordinal labels, and extract low-dimensional\nnonlinear representations from covariates. Further, to achieve balanced distributions in the low-\ndimensional space, a maximum mean discrepancy (MMD) criterion [4] is incorporated to the model.\nFinally, matching strategy is performed on the extracted balanced representations, in order to provide\na robust estimation of causal effect. In summary, the main contributions of our work include:\n\ndemonstrate its superiority over the state-of-the-art methods.\n\n2 Background\nPotential Outcome Framework. The potential outcome framework is proposed by Neyman and\nRubin [27, 33]. Considering binary treatments for a set of units, there are two possible outcomes for\neach unit. Formally, for unit k, the outcome is de\ufb01ned as Yk(1) if it received treatment, and Yk(0) if\nit did not. Then, the individual-level treatment effect is de\ufb01ned as \u03b3k = Yk(1) \u2212 Yk(0). Clearly, each\nunit only belongs to one of the two groups, and therefore, we can only observe one of the two possible\noutcomes. This is the well-known missing data problem in causal inference. In particular, if unit k\nreceived treatment, Yk(1) is the observed outcome, and Yk(0) is missing data, i.e., counterfactual.\nThe potential outcome framework usually makes the following assumptions [19].\nAssumption 1. Stable Unit Treatment Value Assumption (SUTVA): The potential outcomes for\nany units do not vary with the treatments assigned to other units, and for each unit there are no\ndifferences forms or versions of each treatment level, which lead to different potential outcomes.\nAssumption 2. Strongly Ignorable Treatment Assignment (SITA): Conditional on covariates xk,\ntreatment Tk is independent of potential outcomes.\n(Yk(1), Yk(0))|= Tk|xk.\n0 < Pr(Tk = 1|xk) < 1.\n\n(Unconfoundedness)\n(Overlap)\n\n(1)\n\nThese assumptions enable the modeling of treatment of one unit with respect to covariates, indepen-\ndent of outcomes and other units.\nMatching Estimators. To address the aforementioned missing data problem, a simple yet effective\nstrategy has been developed, which is matching [32, 33, 14, 40]. The idea of matching is to estimate\n\n2\n\nand nonlinear representations via kernel learning.\n\n\u2022 We propose a novel matching estimator, BNR-NNM, which learns low-dimensional balanced\n\u2022 We convert the counterfactual prediction problem into a multi-class classi\ufb01cation problem,\n\u2022 We incorporate a domain adaptation constraint to feature learning by using the maximum\n\u2022 We evaluate the proposed estimator on both synthetic datasets and real-world datasets, and\n\nand design an OSD criterion for nonlinear kernel learning with ordinal labels.\n\nmean discrepancy criterion, which leads to balanced representations.\n\n\fthe counterfactual for a treated unit by seeking its most similar counterpart in the control group.\nExisting matching methods can be roughly divided into three categories: nearest neighbor matching\n(NNM), weighting, and subclassi\ufb01cation. We mainly focus on NNM in this paper.\nLet XC \u2208 Rd\u00d7NC and XT \u2208 Rd\u00d7NT denote the covariates of a control group and a treatment group,\nrespectively, where d is the number of covariates, NC and NT are the group sizes. T is a binary\nvector indicating if the units received treatments (i.e., Tk = 1) or not (i.e., Tk = 0). Y is an outcome\nvector. For each treated unit k, NNM \ufb01nds its nearest neighbor in the control group in terms of the\ncovariates. The outcome of the selected control unit is considered as an estimation of counterfactual.\nThen, the average treatment effect on treated (ATT) is de\ufb01ned as:\n\n(cid:88)\n\n(cid:0)Yk(1) \u2212 \u02c6Yk(0)(cid:1),\n\nAT T =\n\n1\nNT\n\nk:Tk=1\n\n(2)\n\nwhere \u02c6Yk(0) is the counterfactual estimated from unit k\u2019s nearest neighbor in the control group.\nNNM can be implemented in various ways, such as using different distance metrics, or choosing\ndifferent number of neighbors. Euclidean distance and Mahalanobis distance are two widely-used\ndistance metrics for NNM. They work well when there are a few covariates with normal distribu-\ntions [34]. Another important matching estimator is propensity score matching (PSM) [31]. PSM\nestimates the propensity score (i..e., the probability of receiving treatment) for each unit via logistic\nregression, and pairs the units from two groups with similar scores [35, 8, 30]. Most recently, a\ncovariate balancing propensity score (CBPS) method is developed to balance the distributions of two\ngroups by weighting the covariates, and has shown promising performance [18].\nThe key differences between the proposed BNR-NNM estimator and the traditional matching es-\ntimators are two-fold. First, BNR-NNM performs matching in an intermediate low-dimensional\nsubspace that could guarantee a low estimation bias, while the traditional estimators adopt either the\noriginal covariate space or the one-dimensional space. Second, BNR-NNM explicitly considers the\nbalanced distributions across treatment and control groups, while the traditional estimators usually\nfail to achieve such a property.\nMachine Learning for Causal Inference. In recent years, researchers have been exploring the\nrelationships between causal inference and machine learning [39, 10, 38]. A number of predictive\nmodels have been designed to estimate the causal effects, such as causal trees [3] and causal\nforests [42]. Balancing the distributions of two groups is considered as a key issue in observational\nstudy, which is closely related to covariate shift and in general domain adaptation [2]. Meanwhile,\ncausal inference has also been incorporated to improve the performance of domain adaptation [46, 45].\nMost recently, the idea of representation learning is introduced to learn new features from covariates\nthrough random projections [25], informative subspace learning [7], and deep neural networks [21,\n37].\n\n3 Learning Balanced and Nonlinear Representations (BNR)\nIn this section, we \ufb01rst de\ufb01ne the notations that will be used throughout this paper. Then we introduce\nhow to convert the counterfactual prediction problem into a multi-class classi\ufb01cation problem, and\njustify the rationality of this strategy. We will also present the details of how to learn nonlinear and\nbalanced representations, and derive the closed-form solutions to the model.\nNotations. Let X = [XC, XT] \u2208 Rd\u00d7N denote the covariates of all units, where XC \u2208 Rd\u00d7NC is\nthe control group with NC units, and XT \u2208 Rd\u00d7NT is the treatment group with NT units. N is the\ntotal number of units, and d is the number of covariates for each unit. \u03c6 : x \u2208 Rd \u2192 \u03c6(x) \u2208 F is\na nonlinear mapping function from sample space R to an implicit feature space F. T \u2208 RN\u00d71 is a\nbinary vector to indicate if the units received treatments or not. Y \u2208 RN\u00d71 is an outcome vector. The\nelements in Y could be either discrete or continuous values.\n\n3.1 From Counterfactual Prediction to Multi-Class Classi\ufb01cation\n\nWhen estimating the treatment effects as shown in Eq.(2), we only have the observed outcome Yk(1),\nbut need to estimate the counterfactual \u02c6Yk(0). Ideally, we would train a model \u02c6Yk(0) = Fcf (xk) that\ncan predict the counterfactual for any units, given the covariate vector xk. One strategy is to build a\npredictive model (e.g., regression) that maps each unit xi to its output Yi, which has been extensively\n\n3\n\n\fstudied before. Alternatively, we can convert the counterfactual prediction problem into a multi-class\nclassi\ufb01cation problem.\nGiven a set of units X and the corresponding outcome vector Y , we aim to learn a predictive\nmodel Fcf (xk) that maps from the covariate space to the outcome space. In particular, we propose\nto seek an intermediate representation space in which the units close to each other should have\nvery similar outcomes. The outcome vector Y usually contains continuous values. We categorize\noutcomes in Y into multiple levels on the basis of the magnitude of outcome value, and consider\nthem as (pseudo) class labels. Clustering or kernel density estimation can be used for discretizing\nY . Finally, Y is converted to a (pseudo) class label vector Yc with c categories. For example,\nY = [0.3, 0.5, 1.1, 1.2, 2.4] could be categorized as Y3 = [1, 1, 2, 2, 3]. As a result, we could use Yc\nand X to train a classi\ufb01er.\nNote that the Yc actually contains ordinal labels, as the discretized labels carry additional information.\nIn particular, the labels [1, 2, 3] are not totally independent. We actually assume that Class 1 should\nbe more close to Class 2 than Class 3, since the outcome values in Class 1 are closer to those in Class\n2. We will make use of such ordinal label information when designing the classi\ufb01cation model.\n\n3.2 Learning Nonlinear Representations via Ordinal Scatter Discrepancy\n\nTo obtain effective representations from X, we propose to train a nonlinear classi\ufb01er in a reproducing\nkernel Hilbert space (RKHS). The reasons of employing the RKHS based nonlinear models are as\nfollows. First, compared to linear models, nonlinear models are usually more capable of dealing\nwith complicated data distributions. It is well known that the treatment and control groups might\nhave diverse distributions, and the nonlinear models would be able to tightly couple them in a shared\nlow-dimensional subspace. Second, the RKHS based nonlinear models usually have closed-form\nsolutions because of the kernel trick, which is bene\ufb01cial for handling large-scale data.\nin kernel space, and then \u03a6(X) =\nLet \u03c6(xi) denote the mapped counterpart of xi\n[\u03c6(x1), \u03c6(x2),\u00b7\u00b7\u00b7 , \u03c6(xN )].\nIn light of the maximum scatter difference criterion [26], we take\ninto account the ordinal label information, and propose a novel criterion named Ordinal Scatter Dis-\ncrepancy (OSD) to achieve the desired data distribution after projecting \u03a6(X) to a low-dimensional\nsubspace. In particular, OSD minimizes the within-class scatter, and meanwhile maximize the\nnoncontiguous-class scatter matrix. Let P denote a transformation matrix, OSD maps samples onto a\nsubspace by maximizing the differences of noncontiguous-class scatter and within-class scatter. We\nperform OSD in kernel space to learn nonlinear representations, and have the following objective\nfunction:\n\nF (P, \u03a6(X), Yc) = tr(P (cid:62)(KI \u2212 \u03b1KW )P ),\nP (cid:62)P = I,\n\n(3)\nwhere \u03b1 is a non-negative trade-off parameter, tr(\u00b7) is the trace operator for matrix, and I is an identity\nmatrix. The orthogonal constraint P (cid:62)P = I is introduced to reduce the redundant information in\nprojection.\nIn Eq.(3), KI and KW are the noncontiguous-class scatter matrix and within-class scatter matrix in\nkernel space, respectively. The detailed de\ufb01nitions are:\n\narg max\n\nP\ns.t.\n\nI = c(c\u22121)\nK \u03a6\n\n2\n\ni=1\n\nj=i+1\n\ne(j\u2212i)(mi \u2212 mj)(mi \u2212 mj)(cid:62)\n\n(4)\n\n(\u03be(xij) \u2212 \u00afm)(\u03be(xij) \u2212 \u00afmi)(cid:62)\n\ni=1\n\nj=1\n\nK \u03a6\n\nW = 1\nN\n\n(5)\nwhere \u03be(xij) = [k(x1, xij), k(x2, xij),\u00b7\u00b7\u00b7 , k(xN , xij)](cid:62), mi is the mean vector of \u03be(xij) that\nbelongs to the i-th class, \u00afm is the mean vector of all \u03be(xij), and ni is the number of units in the\ni-th class. k(xi, xj) = (cid:104)\u03c6(xi), \u03c6(xj)(cid:105) is a kernel function, which is utilized to avoid calculating the\nexplicit form of function \u03c6 (i.e., the kernel trick).\nEq. (4) characterizes the scatter of a set of classes with (pseudo) ordinal labels. It measures the scatter\nof every pair of classes. The factor e(j\u2212i) is used to penalize the classes that are noncontiguous. The\nintuition is that, for ordinal labels, we may expect the contiguous classes will be close to each other\nafter projection, while the noncontiguous classes should be pushed away. Therefore, we put larger\n\nc(cid:80)\nc(cid:80)\n\nc(cid:80)\nni(cid:80)\n\n4\n\n\fweights for the noncontiguous classes. For example, e(2\u22121) < e(3\u22121), since Class 1 should be more\nclose to Class 2 than Class 3, as we explained in Section 3.1.\nEq. (5) measures the within-class scatter. We expect that the units having the same (pseudo) class\nlabels will be very close to each other in the feature space, and therefore they will have similar feature\nrepresentations after projection.\nThe differences between the proposed OSD criterion and other discriminative criteria (e.g., Fisher\ncriterion, maximum scatter difference criterion) are two-fold. (1) OSD criterion learns nonlinear\nprojection and feature representations in the RKHS space; (2) OSD explicitly makes use of the\nordinal label information that are usually ignored by existing criteria. Moreover, the maximum scatter\ndifference criterion is a special case of OSD.\n\n3.3 Learning Balanced Representations via Maximum Mean Discrepancy\n\nBalanced distributions of control and treatment groups, in terms of covariates, would greatly facilitate\nthe causal inference methods such as NNM. To this end, we adopt the idea of maximum mean\ndiscrepancy (MMD) [4] when learning the transformation P , and \ufb01nally obtain balanced nonlinear\nrepresentations. The MMD criterion has been successfully applied to some problems like domain\nadaptation [28].\nAssume that the control group XC and treatment group XT are random variable sets with distributions\nP and Q, MMD implies the empirical estimation of the distance between P and Q. In particular,\nMMD estimates the distance between nonlinear feature sets \u03a6(XC) and \u03a6(XT ), which can be\nformulated as:\n\nDist(\u03a6(XC), \u03a6(XT )) = (cid:107) 1\n\nNC\n\n\u03c6(XCi) \u2212 1\n\nNT\n\n\u03c6(XT i)(cid:107)2F,\n\n(6)\n\nnC(cid:80)\n\ni=1\n\nnT(cid:80)\n\ni=1\n\nwhere F denotes a kernel space.\nBy utilizing the kernel trick, Dist(\u03a6(XC), \u03a6(XT )) in the original kernel space can be equivalently\nconverted to:\n\nDist(\u03a6(XC), \u03a6(XT )) = tr(KL),\n\n(cid:20) KCC KCT\n\n(cid:21)\n\nKT C KT T\n\nis a kernel matrix, KCC, KT T , and KT C are kernel matrices de\ufb01ned\nIf\n\nwhere K =\non control group, treatment group, and cross groups, respectively. L is a constant matrix.\nxi, xj \u2208 XC, Lij = 1\nAs all the units are projected into a new space via projection P , we need to measure the MMD for\nnew representations \u03a8(XC) = P (cid:62)\u03a6(XC) and \u03a8(XT ) = P (cid:62)\u03a6(XT ), and rewrite Eq.(7) into the\nfollowing form after some derivations:\n\n; if xi, xj \u2208 XT , Lij = 1\n\n; otherwise, Lij = \u2212 1\n\nNC NT\n\nN 2\nC\n\nN 2\nT\n\n.\n\nDist(\u03a8(XC), \u03a8(XT )) = tr(P (cid:62)KLKP ).\n\n(7)\n\n(8)\n\n(9)\n\n3.4 BNR Model and Solutions\n\nThe representation learning objectives described in Section 3.2 and Section 3.3 are actually performed\non the same data set with different partitions. For nonlinear representation learning, we merge the\ncontrol group and treatment group, assign a (pseudo) ordinal label for each unit, and then learn\ndiscriminative nonlinear features accordingly. For balanced representation learning, we aim to\nmitigate the distribution discrepancy between control group and treatment group. Two learning\nobjectives are motivated from different perspectives, and therefore they are complementary to each\nother. By combing the objectives for nonlinear and balanced representations in Eq.(3) and Eq.(8), we\ncan extract effective representations for the purpose of treatment effect estimation.\nThe objective function of BNR is formulated as follows:\n\narg max\n\nP\n\ns.t.\n\nF (P, \u03a6(X), Yc) \u2212 \u03b2Dist(\u03a8(XC), \u03a8(XT ))\n= tr(P (cid:62)(KI \u2212 \u03b1KW )P ) \u2212 \u03b2tr(P (cid:62)KLKP ),\nP (cid:62)P = I,\n\nwhere \u03b2 is a trade-off parameter to balance the effects of two terms. A negative sign is added before\n\u03b2Dist(\u03a8(XC), \u03a8(XT )) in order to adapt it into this maximization problem.\n\n5\n\n\fAlgorithm 1. BNR-NNM\nInput: Treatment group XT \u2208 Rd\u00d7Nt\n\n\u02c6XC = P (cid:62)KC , \u02c6XT = P (cid:62)KT .\n\n7: Perform NNM between \u02c6XC and \u02c6XT\n8: Estimate the ATT A from Eq.(2)\nOutput: Return A\n\nControl group XC \u2208 Rd\u00d7Nc\nOutcome vectors YT and YC\nTotal sample size N\nKernel function k\nParameters \u03b1, \u03b2, c\n\nThe problem Eq.(9) can be ef\ufb01ciently solved by using a closed-form solution described in Proposition\n1. The proof is provided in the supplementary document due to space limit.\nProposition 1. The optimal solution of P in problem Eq.(9) is the eigenvectors of matrix (KI \u2212\n\u03b1KW \u2212 \u03b2KLK), which correspond to the m leading eigenvalues.\n4 BNR for Nearest Neighbor Matching\nLeveraging on the balanced nonlinear representations extracted from observational data, we propose\na novel nearest neighbor matching estimator named BNR-NNM.\nAfter obtaining the transformation P in kernel space, we could generate nonlinear and balanced\nrepresentations for control and treated units as: \u02c6XC = P (cid:62)KC, \u02c6XT = P (cid:62)KT , where KC and KT\nare kernel matrices de\ufb01ned in control and treatment groups, respectively. Then we follow the basic\nidea of nearest neighbor matching. On the new representations \u02c6XC and \u02c6XT , we calculate the distance\nbetween each treated unit and control unit, and choose the one with the smallest distance. The\noutcome of the selected control unit serves as the estimation of counterfactual. Finally, the average\ntreatment effect on treated (ATT) can be calculated, as de\ufb01ned in Eq.(2). The complete procedures of\nBNR-NNM are summarized in Algorithm 1.\nThe estimated ATT is dependent on the transfor-\nmation matrix P . Although P is optimal for the\nrepresentation learning model Eq.(9), it might not\nbe optimal for the whole causal inference process,\nfor three reasons. First, the model Eq.(9) contains\ntwo major hyperparameters, \u03b1 and \u03b2. Different \u201cop-\ntimal\u201d transformations P would be obtained with\ndifferent parameter settings. Second, the ground-\ntruth label information required by supervised learn-\ning are unknown. Recall that we categorize the\noutcome vector as pseudo labels, which introduces\nconsiderable uncertainty. Third, the ground-truth\ninformation of causal effect is unknown in observa-\ntional studies with real-world data. Therefore, it is\nimpossible to use the faithful supervision informa-\ntion of causal effect to guide the learning process.\nThese uncertainties from three perspectives might result in an unreliable estimation of ATT.\nThus, we present two strategies to tackle the above issue. (1) Median causal effect from multiple\nestimations. Following the randomized NNM estimator [25], we implement multiple settings of\nBNR-NNM with different parameters \u03b1, \u03b2 and c, calculate multiple ATT values, and \ufb01nally choose\nthe median value as the \ufb01nal estimation. In this way, a robust estimation of causal effect can be\nobtained. (2) Model selection by cross-validation. Alternatively, the cross-validation strategy can be\nemployed to select proper values for \u03b1 and \u03b2, by equally dividing the data and pseudo labels into k\nsubsets. Although the multiple runs in the above strategies would increase the computational cost,\nour method is still ef\ufb01cient for three reasons. First, the dimension of covariates will be signi\ufb01cantly\nreduced, which enables a faster matching process. Second, owing to the closed-form solution to P\nintroduced in Proposition 1, the representation learning procedure is ef\ufb01cient. Third, these settings\nare independent from each other, and therefore they can be executed in parallel.\n5 Experiments and Analysis\nSynthetic Dataset. Data Generation. We generate a synthetic dataset by following the protocols\ndescribed in [41, 25]. In particular, the sample size N is set to 1000, and the number of covariates d is\nset to 100. The following basis functions are adopted in the data generation process: g1(x) = x\u2212 0.5,\ng2(x) = (x \u2212 0.5)2 + 2, g3(x) = x2 \u2212 1/3, g4(x) = \u22122 sin(2x), g5(x) = e\u2212x \u2212 e\u22121 \u2212 1,\ng6(x) = e\u2212x, g7(x) = x2, g8(x) = x, g9(x) = Ix>0, and g10(x) = cos(x). For each unit, the\ncovariates x1, x2,\u00b7\u00b7\u00b7 , xd are drawn independently from the standard normal distribution N (0, 1).\n(cid:80)5\nWe only consider binary treatment in this paper, and de\ufb01ne the treatment vector T as T|x = 1 if\noutcome variables in Y are generated from the following model: Y |x, T \u223c N ((cid:80)5\nk=1 gk(xk) > 0 and T|x = 0 otherwise. Given covariate vector x and the treatment vector T , the\nj=1 gj+5(xj) +\n\n1: Convert outcomes to (pseudo) ordinal labels\n2: Construct KI and KW using Eqs.(4) and (5)\n3: Construct kernel matrix K using Eq.(7)\n4: Learn the transformation P using Eq.(9)\n5: Construct kernel matrix KC and KT\n6: Project KC and KT using P\n\n6\n\n\fT, 1). It is obvious that Y contains continuous values. The \ufb01rst \ufb01ve covariates are correlated to the\ntreatments in T and the outcomes in Y , simulating a confounding effect, while the rest are noisy\ncomponents. By de\ufb01nition, the true causal effect (i.e., the ground truth of ATT) in this dataset is 1.\nBaselines and Settings. We compare our matching estimator BNR-NNM with the following baseline\nmethods: Euclidean distance based NNM (Eud-NNM), Mahalanobis distance based NNM (Mah-\nNNM) [34], PSM [31], principal component analysis based NNM (PCA-NNM), locality preserving\nprojections based NNM (LPP-NNM), and randomized NNM (RNNM) [25].\nPSM is a classical causal inference approach,\nwhich estimates the propensity scores for each\ncontrol or treated unit using logistic regression,\nand then perform matching on these scores. As\nour approach learns new representations via\ntransformations, we also implement two match-\ning estimators based on the popular subspace\nlearning methods PCA [22] and LPP [13]. The\nnearest neighbor matching is performed on the\nlow-dimensional feature space learned by PCA\nand LPP, respectively. RNNM is the state-of-\nthe-art matching estimator, especially for high-\ndimensional data. It projects units to multiple\nrandom subspaces, performs matching in each\nof them, and \ufb01nally selects the median value of\nestimations. In RNNM, the number of random\nprojections is set to 20. The proposed BNR-\nNNM and RNNM share a similar idea on pro-\njecting data to low-dimensional subspaces, but\nthey have different motivations and learn differ-\nent data representations.\nThe major parameters in BNR-NNM include \u03b1, \u03b2, and c. In the experiments, \u03b1 is empirically set to\n1. \u03b2 is chosen from {10\u22123, 10\u22121, 1, 10, 103}. The number of categories c is chosen from {2, 4, 6, 8}.\nAs described in Section 4, the median ATT of multiple estimations is used as the \ufb01nal result. We use\nthe Gaussian kernel function k(xi, xj) = exp(\u2212(cid:107)xi \u2212 xj(cid:107)2/2\u03c32), in which the bandwidth parameter\n\u03c3 is empirically set to 5. In the experiments we observe that our approach allows \ufb02exible settings\nfor these parameters, and intuitively selecting parameters from a wider range would lead to a robust\nestimation of ATT.\nResults and Discussions. To ensure a robust estimation of the performance of each matching\nestimator, we repeat the data generation process 500 times, calculate the ATT for each estimator\nin every replication, and compute the mean square error (MSE) with standard error (SD) for each\nestimator over all of the replications. Eud-NNM and Mah-NNM perform matching in the original\ncovariate space, and PSM maps each unit to a single score. Thus we only have a single point\nestimation for each of them. For PCA-NNM, LPP-NNM, RNNM and our method, we can choose the\ndimension of feature space where the matching is conducted. Speci\ufb01cally, we increase the dimension\nfrom 2 to 100, and calculate MSE and SD in each case. Figure 1 shows the MSE and SD (shown\nas error bars) of each estimator when varying the dimensions. We observe from Figure 1 that the\nproposed estimator BNR-NNM obtains lower MSE than all other methods in every case. The lowest\nMSE is achieved when the dimension is 5. In addition, we have analyzed the sensitivity of parameter\nsettings. The detailed results are provided in the supplementary document.\nIHDP Dataset with Simulated Outcomes. IHDP data [16] is an experimental dataset collected by\nthe Infant Health and Development Program. In particular, a randomized experiment was conducted,\nwhere intensive high-quality care were provided to the low-birth-weight and premature infants. By\nusing the original data, an observation study can be conducted by removing a nonrandom subset of\nthe treatment group: all children with non-white mothers. After this preprocessing step, there are in\ntotal 24 pretreatment covariates (excluding race) and 747 units, including 608 control units and 139\ntreatment units. The outcomes are simulated by using the pretreatment covariates and the treatment\nassignment information, in order to hold the unconfoundedness assumption.\n\nFigure 1: MSE of different estimators on the syn-\nthetic dataset. Note that Eud-NNM and Mah-NNM\nonly involve matching in the original 100 dimen-\nsional data space.\n\n7\n\n2510203040506070809010010\u2212210\u22121100DimensionMean Square Error Eud\u2212NNMPSMMah\u2212NNMPCA\u2212NNMLPP\u2212NNMRNNMBNR\u2212NNM (Ours)\fTable 1: Results on IHDP dataset.\n\n\u03b5AT T\n\nMethod\n0.18\u00b10.06\nEu-NNM\nMah-NNM 0.31\u00b10.12\n0.26\u00b10.08\nPSM\nPCA-NNM 0.19\u00b10.11\nLPP-NNM 0.25\u00b10.13\n0.16\u00b10.07\nRNNM\nBNR-NNM 0.16\u00b10.06\n\nDue to the space limit, the outcome simulation procedures\nare provided in the supplementary document. We repeat such\nprocedures for 200 times and generate 200 sets of simulated\noutcomes, in order to conduct extensive evaluations. For each\nset of simulated outcomes, we run our method and the base-\nlines introduced above, and report the results in Table 1. We\nuse the error in average treatment effect on treated (ATT),\n\u03b5AT T , as the evaluation metric. It is de\ufb01ned as the absolute\ndifference between true ATT and estimated ATT ((cid:91)AT T ), i.e.,\n\u03b5AT T = |AT T \u2212 (cid:91)AT T|. Table 1 shows that the proposed\nBNR-NNM estimator outperforms most baselines, which further validates the effectiveness of the\nbalanced and nonlinear representations.\nLaLonde Dataset with Real Outcomes. The LaLonde dataset is a widely used benchmark for\nobservational studies [23]. It consists of a treatment group and a control group. The treatment group\ncontains 297 units from a randomized study of a job training program (the \u201cNational Supported\nWork Demonstration\u201d), where an unbiased estimate of the average treatment effect is available. The\noriginal LaLonde dataset contains 425 control units that are collected from the Current Population\nSurvey. Recently, Imai et al. augmented the data by including 2,490 units from the Panel Study\nof Income Dynamics [18]. Thus, the sample size of control group is increased to 2,915. For each\nsample, the covariates include age, education, race (black, white, or Hispanic), marriage status, high\nschool degree, earnings in 1974, and earnings in 1975. The outcome variable is earnings in 1978. In\nthis benchmark dataset, the unbiased estimation of ATT is $886 with a standard error of $448.\nWe compare our estimator with the baselines\nused in the previous experiments.\nIn addi-\ntion, we also compare with a recently pro-\nposed matching estimator, covariate balancing\npropensity score (CBPS) [18] and a deep neural\nnetwork (DNN) method [37]. CBPS aims to\nachieve balanced distributions between control\nand treatment groups by adjusting the weights\nfor covariates. The DNN method utilizes a deep\nneural network architecture for counterfactual\nregression, which is the state-of-the-art method\non representation learning based counterfactual\ninference. For BNR-NNM, we use the same set-\ntings for \u03b2 and c as in the previous experiments.\nTable 2 shows the ground truth of ATT, and the estimations of different methods. We can observe\nfrom Table 2 that CBPS and DNN obtain better results than other baselines, as both of them consider\nthe balanced property across treatment and control groups. Moreover, our BNR-NNM estimator\nachieves the best result, due to the fully exploitation of balanced and nonlinear feature representations.\nThe evaluations on runtime behavior of each compared method are provided in the supplementary\ndocument due to space limit.\n\nTable 2: Results on LaLonde dataset. BIAS (%) is\nthe bias in percentage of the true effect.\nMethod\nGround Truth\nEu-NNM\nMah-NNM\nPSM\nPCA-NNM\nLPP-NNM\nRNNM\nCBPS\nDNN\nBNR-NNM\n\nSD\n488\n592.8\n526.1\n567.9\n592.5\n581.2\n584.9\n1295.2\nN/A\n546.3\n\nATT\n886\n-565.9\n-67.9\n-947.6\n-499.8\n-457.1\n-557.6\n423.3\n742.0\n783.6\n\nBIAS (%)\n\nN/A\n164%\n108%\n201%\n156%\n152%\n163%\n52%\n16%\n12%\n\n6 Conclusions\nIn this paper, we propose a novel matching estimator based on balanced and nonlinear representations\nfor treatment effect estimation. Our method leverages on the predictive power of machine learning\nmodels to estimate counterfactuals, and achieves balanced distributions in an intermediate feature\nspace. In particular, an ordinal scatter discrepancy criterion is designed to extract discriminative\nfeatures from observational data with ordinal pseudo labels, while a maximum mean discrepancy\ncriterion is incorporated to achieve balanced distributions. Extensive experimental results on three\nsynthetic and real-world datasets show that our approach provides more accurate estimation of causal\neffects than the state-of-the-art matching estimators and representation learning methods. In future\nwork, we will extend the balanced representation learning model to other causal inference strategies\nsuch as weighting and regression, and design estimators for multiple levels of treatments.\nAcknowledgement. This research is supported in part by the NSF IIS award 1651902, ONR Young\nInvestigator Award N00014-14-1-0484, and U.S. Army Research Of\ufb01ce Award W911NF-17-1-0367.\n\n8\n\n\fReferences\n[1] Alberto Abadie and Guido W Imbens. Large sample properties of matching estimators for average\n\ntreatment effects. Econometrica, 74(1):235\u2013267, 2006.\n\n[2] Deepak Agarwal, Lihong Li, and Alexander J Smola. Linear-time estimators for propensity scores. In\nProceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, pages 93\u2013100, 2011.\n\n[3] Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of\n\nthe National Academy of Sciences, 113(27):7353\u20137360, 2016.\n\n[4] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Sch\u00f6lkopf, and Alex J\nSmola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics,\n22(14):e49\u2013e57, 2006.\n\n[5] Kay H Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L Scott. Inferring causal\nimpact using bayesian structural time-series models. The Annals of Applied Statistics, 9(1):247\u2013274, 2015.\n\n[6] David Chan, Rong Ge, Ori Gershony, Tim Hesterberg, and Diane Lambert. Evaluating online ad campaigns\nin a pipeline: causal models at scale. In Proceedings of the 16th ACM SIGKDD International Conference\non Knowledge Discovery and Data Mining, pages 7\u201316. ACM, 2010.\n\n[7] Yale Chang and Jennifer G Dy. Informative subspace learning for counterfactual inference. In Proceedings\n\nof the Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, pages 1770\u20131776, 2017.\n\n[8] Rajeev H Dehejia and Sadek Wahba. Propensity score-matching methods for nonexperimental causal\n\nstudies. Review of Economics and Statistics, 84(1):151\u2013161, 2002.\n\n[9] Alexis Diamond and Jasjeet S Sekhon. Genetic matching for estimating causal effects: A general\nmultivariate matching method for achieving balance in observational studies. Review of Economics and\nStatistics, 95(3):932\u2013945, 2013.\n\n[10] Doris Entner, Patrik Hoyer, and Peter Spirtes. Data-driven covariate selection for nonparametric estimation\nof causal effects. In Proceedings of the Sixteenth International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 256\u2013264, 2013.\n\n[11] David Galles and Judea Pearl. An axiomatic characterization of causal counterfactuals. Foundations of\n\nScience, 3(1):151\u2013182, 1998.\n\n[12] Thomas A Glass, Steven N Goodman, Miguel A Hern\u00e1n, and Jonathan M Samet. Causal inference in\n\npublic health. Annual Review of Public Health, 34:61\u201375, 2013.\n\n[13] Xiaofei He and Partha Niyogi. Locality preserving projections. In Advances in Neural Information\n\nProcessing Systems, pages 153\u2013160, 2004.\n\n[14] James J Heckman, Hidehiko Ichimura, and Petra Todd. Matching as an econometric evaluation estimator.\n\nThe Review of Economic Studies, 65(2):261\u2013294, 1998.\n\n[15] Daniel N Hill, Robert Moakler, Alan E Hubbard, Vadim Tsemekhman, Foster Provost, and Kiril Tse-\nmekhman. Measuring causal impact of online actions via natural experiments: application to display\nadvertising. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, pages 1839\u20131847. ACM, 2015.\n\n[16] Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and\n\nGraphical Statistics, 20(1):217\u2013240, 2012.\n\n[17] Stefano M Iacus, Gary King, and Giuseppe Porro. Causal inference without balance checking: Coarsened\n\nexact matching. Political Analysis, 20(1):1\u201324, 2011.\n\n[18] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical\n\nSociety: Series B (Statistical Methodology), 76(1):243\u2013263, 2014.\n\n[19] Guido Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An\n\nIntroduction. Cambridge University Press, 2015.\n\n[20] Hui Jin and Donald B Rubin. Principal strati\ufb01cation for causal inference with extended partial compliance.\n\nJournal of the American Statistical Association, 103(481):101\u2013111, 2008.\n\n[21] Fredrik D. Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference.\n\nIn Proceedings of the 33nd International Conference on Machine Learning, pages 3020\u20133029, 2016.\n\n9\n\n\f[22] Ian Jolliffe. Principal Component Analysis. John Wiley and Sons, 2002.\n\n[23] Robert J LaLonde. Evaluating the econometric evaluations of training programs with experimental data.\n\nThe American Economic Review, pages 604\u2013620, 1986.\n\n[24] Brian K Lee, Justin Lessler, and Elizabeth A Stuart. Improving propensity score weighting using machine\n\nlearning. Statistics in Medicine, 29(3):337\u2013346, 2010.\n\n[25] Sheng Li, Nikos Vlassis, Jaya Kawale, and Yun Fu. Matching via dimensionality reduction for estimation\nof treatment effects in digital marketing campaigns. In Proceedings of the Twenty-Fifth International Joint\nConference on Arti\ufb01cial Intelligence, pages 3768\u20133774, 2016.\n\n[26] Qingshan Liu, Xiaoou Tang, Hanqing Lu, Songde Ma, et al. Face recognition using kernel scatter-\ndifference-based discriminant analysis. IEEE Transactions on Neural Networks, 17(4):1081\u20131085, 2006.\n\n[27] Jerzy Neyman. On the application of probability theory to agricultural experiments. Essay on principles.\n\nSection 9. Statistical Science, 5(4):465\u2013480, 1923.\n\n[28] Sinno Jialin Pan, James T Kwok, and Qiang Yang. Transfer learning via dimensionality reduction. In\nProceedings of the Twenty-Third AAAI Conference on Arti\ufb01cial Intelligence, volume 8, pages 677\u2013682,\n2008.\n\n[29] Judea Pearl. Causality. Cambridge University Press, 2009.\n\n[30] Deborah N Peikes, Lorenzo Moreno, and Sean Michael Orzol. Propensity score matching: A note of\n\ncaution for evaluators of social programs. The American Statistician, 62(3):222\u2013231, 2008.\n\n[31] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies\n\nfor causal effects. Biometrika, 70(1):41\u201355, 1983.\n\n[32] Donald B Rubin. Matching to remove bias in observational studies. Biometrics, pages 159\u2013183, 1973.\n\n[33] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.\n\nJournal of Educational Psychology, 66(5):688\u2013701, 1974.\n\n[34] Donald B Rubin. Using multivariate matched sampling and regression adjustment to control bias in\n\nobservational studies. Journal of the American Statistical Association, 74(366):318\u2013328, 1979.\n\n[35] Donald B Rubin and Neal Thomas. Combining propensity score matching with additional adjustments for\n\nprognostic covariates. Journal of the American Statistical Association, 95(450):573\u2013585, 2000.\n\n[36] Adam C Sales, Asa Wilks, and John F Pane. Student usage predicts treatment effect heterogeneity in the\ncognitive tutor algebra i program. In Proceedings of the International Conference on Educational Data\nMining, pages 207\u2013214, 2016.\n\n[37] Uri Shalit, Fredrik Johansson, and David Sontag. Bounding and minimizing counterfactual error. arXiv\n\npreprint arXiv:1606.03976, 2016.\n\n[38] Ricardo Silva and Robin Evans. Causal inference through a witness protection program. In Advances in\n\nNeural Information Processing Systems, pages 298\u2013306, 2014.\n\n[39] Peter Spirtes. Introduction to causal inference. Journal of Machine Learning Research, 11(May):1643\u2013\n\n1662, 2010.\n\n[40] Elizabeth A Stuart. Matching methods for causal inference: A review and a look forward. Statistical\n\nscience: a review journal of the Institute of Mathematical Statistics, 25(1):1\u201321, 2010.\n\n[41] Wei Sun, Pengyuan Wang, Dawei Yin, Jian Yang, and Yi Chang. Causal inference via sparse additive\nmodels with application to online advertising. In Proceedings of Twenty-Ninth AAAI Conference on\nArti\ufb01cial Intelligence, pages 297\u2013303, 2015.\n\n[42] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random\n\nforests. arXiv preprint arXiv:1510.04342, 2015.\n\n[43] Pengyuan Wang, Wei Sun, Dawei Yin, Jian Yang, and Yi Chang. Robust tree-based causal inference for\ncomplex ad effectiveness analysis. In Proceedings of the Eighth ACM International Conference on Web\nSearch and Data Mining, pages 67\u201376. ACM, 2015.\n\n10\n\n\f[44] Pengyuan Wang, Dawei Yin, Jian Yang, Yi Chang, and Marsha Meytlis. Rethink targeting: detect\n\u2018smart cheating\u2019 in online advertising through causal inference. In Proceedings of the 24th International\nConference on World Wide Web Companion, pages 133\u2013134, 2015.\n\n[45] Kun Zhang, Mingming Gong, and Bernhard Sch\u00f6lkopf. Multi-source domain adaptation: A causal view.\nIn Proceedings of the Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, pages 3150\u20133157, 2015.\n\n[46] Kun Zhang, Bernhard Sch\u00f6lkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target\nand conditional shift. In Proceedings of the International Conference on Machine Learning (3), pages\n819\u2013827, 2013.\n\n11\n\n\f", "award": [], "sourceid": 596, "authors": [{"given_name": "Sheng", "family_name": "Li", "institution": "Adobe Research"}, {"given_name": "Yun", "family_name": "Fu", "institution": "Northeastern University"}]}