{"title": "Representation Learning for Treatment Effect Estimation from Observational Data", "book": "Advances in Neural Information Processing Systems", "page_first": 2633, "page_last": 2643, "abstract": "Estimating individual treatment effect (ITE) is a challenging problem in causal inference, due to the missing counterfactuals and the selection bias. Existing ITE estimation methods mainly focus on balancing the distributions of control and treated groups, but ignore the local similarity information that is helpful. In this paper, we propose a local similarity preserved individual treatment effect (SITE) estimation method based on deep representation learning. SITE preserves local similarity and balances data distributions simultaneously, by focusing on several hard samples in each mini-batch. Experimental results on synthetic and three real-world datasets demonstrate the advantages of the proposed SITE method, compared with the state-of-the-art ITE estimation methods.", "full_text": "Representation Learning for Treatment Effect\n\nEstimation from Observational Data\n\nLiuyi Yao\n\nSUNY at Buffalo\n\nliuyiyao@buffalo.edu\n\nSheng Li\n\nUniversity of Georgia\nsheng.li@uga.edu\n\nYaliang Li\n\nTencent Medical AI Lab\n\nyaliangli@tencent.com\n\nMengdi Huai\n\nSUNY at Buffalo\n\nmengdihu@buffalo.edu\n\nJing Gao\n\nSUNY at Buffalo\n\njing@buffalo.edu\n\nAidong Zhang\nSUNY at Buffalo\n\nazhang@buffalo.edu\n\nAbstract\n\nEstimating individual treatment effect (ITE) is a challenging problem in causal\ninference, due to the missing counterfactuals and the selection bias. Existing ITE\nestimation methods mainly focus on balancing the distributions of control and\ntreated groups, but ignore the local similarity information that provides meaning-\nful constraints on the ITE estimation. In this paper, we propose a local similarity\npreserved individual treatment effect (SITE) estimation method based on deep\nrepresentation learning. SITE preserves local similarity and balances data distri-\nbutions simultaneously, by focusing on several hard samples in each mini-batch.\nExperimental results on synthetic and three real-world datasets demonstrate the\nadvantages of the proposed SITE method, compared with the state-of-the-art ITE\nestimation methods.\n\n1\n\nIntroduction\n\nEstimating the causal effect of an intervention/treatment at individual-level is an important problem\nthat can bene\ufb01t many domains including health care [12, 1], digital marketing [6, 34, 24], and\nmachine learning [10, 37, 21, 23, 20]. For example, in the medical area, many pharmaceuticals\ncompanies have developed various anti-hypertensive medicines and they all claim to be effective\nfor high blood pressure. However, for a speci\ufb01c patient, which one is more effective? Treatment\neffect estimation methods are necessary to answer the above question, and it leads to better decision\nmaking. Treatment effect could be estimated at either the group-level or individual-level. In this\npaper, we focus on the individual treatment effect (ITE) estimation.\nTwo types of studies are usually conducted for estimating the treatment effect, including the ran-\ndomized controlled trials (RCTs) and observational study. In RCTs, the treatment assignment is\ncontrolled, and thus the distributions of treatment and control groups are known, which is a de-\nsired property for treatment effect estimation. However, conducting RCTs is expensive and time-\nconsuming, sometimes it even faces some ethical issues. Unlike RCTs, observational study directly\nestimates treatment effects from the observed data, without any control on the treatment assign-\nment. Owing to the easy access of observed data, observational studies, such as the potential out-\ncome framework [27] and causal graphical models [26, 35], have been widely applied in various\ndomains [15, 38, 12].\nEstimating individual treatment effect from observational data faces two major challenges, missing\ncounterfactuals and treatment selection bias. ITE is de\ufb01ned as the expected difference between the\ntreated outcome and control outcome. However, a unit can only belong to one group, and thus the\noutcome of the other treatment (i.e., counterfactual) is always missing. Estimating counterfactual\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\foutcomes from observed data is a reasonable way to address this issue. However, selection bias\nmakes it more dif\ufb01cult to infer the counterfactuals in practice. For instance, in the uncontrolled\ncases, people have different preferences to the treatment, and thus there could be considerable dis-\ntribution discrepancy across different groups. The distribution discrepancy further leads to an inac-\ncurate estimation of counterfactuals.\nTo overcome the above challenges, some traditional ITE estimation methods use the treatment as-\nsignment as a feature, and train regression models to estimate the counterfactual outcomes [11].\nSeveral nearest neighbor based methods are also adopted to \ufb01nd the nearby training samples, such\nas k-NN [8], propensity score matching [27], and nearest neighbor matching through HSIC cri-\nteria [5]. Besides, some tree and forest based methods [7, 33, 4, 3] view the tree and forests as an\nadaptive neighborhood metric, and estimate the treatment effect at the leaf node. Recently, represen-\ntation learning approaches have been proposed for counterfactual inference, which try to minimize\nthe distribution difference between treated and control groups in the embedding space [30, 18].\nState-of-the-art ITE estimation methods aim to balance the distributions in a global view, however,\nthey ignore the local similarity information. As similar units shall have similar outcomes, it is\nof great importance to preserve the local similarity information among units during representation\nlearning, which decreases the generalization error in counterfactual estimation. This point has also\nbeen con\ufb01rmed by nearest neighbor based methods. Unfortunately, in recent representation learning\nbased approaches, the local similarity information may not be preserved during distribution balanc-\ning. On the other hand, nearest neighbor based methods only consider the local similarity, but cannot\nbalance the distribution globally. Our proposed method combines the advantages of both of them.\nIn this paper, we propose a novel local similarity preserved individual treatment effect estimation\nmethod (SITE) based on deep representation learning. SITE maps mini-batches of units from the\ncovariate space to a latent space using a representation network. In the latent space, SITE preserves\nthe local similarity information using the Position-Dependent Deep Metric (PDDM), and balances\nthe data distributions with a Middle-point Distance Minimization (MPDM) strategy. PDDM and\nMPDM can be viewed as a regularization, which helps learn a better representation and decrease\nthe generalization error in estimating the potential outcomes. Implementing PDDM and MPDM\nonly involves triplet pairs and quartic pairs of units respectively from each mini-batch, which makes\nSITE ef\ufb01cient for large-scale data. The proposed method is validated on both synthetic and real-\nworld datasets, and the experimental results demonstrate its advantages brought by preserving the\nlocal similarity information.\n\n2 Methodology\n\n2.1 Preliminary\n\nIndividual treatment effect (ITE) estimation aims to examine whether a treatment T affects the\noutcome Y (i) of a speci\ufb01c unit i. Let xi \u2208 Rd denote the pre-treatment covariates of unit i, where\nd is the number of covariates. Ti denotes the treatment on unit i. In the binary treatment case, unit i\nwill be assigned to the control group if Ti = 0, or to the treated group if Ti = 1.\nWe follow the potential outcome framework proposed by Neyman and Rubin [29, 31]. If the treat-\nment Ti has not been applied to unit i (also known as the out-of-sample case [30]), Y (i)\nis called the\npotential outcome of treatment Ti = 0 and Y (i)\nthe potential outcome of treatment Ti = 1. On the\nother hand, if the unit i has already received a treatment Ti (i.e., the within-sample case [30]), YTi\nis the factual outcome, and Y1\u2212Ti is the counterfactual outcome. In observational study, only the\nfactual outcomes are available, while the counterfactual outcomes can never been observed.\nThe individual treatment effect on unit i is de\ufb01ned as the difference between the potential treated\nand control outcomes1:\n\n1\n\n0\n\nITEi = Y (i)\n\n1 \u2212 Y (i)\n0 .\n\n(1)\n\nThe challenge to estimate ITEi lies on how to estimate the missing counterfactual outcome. Exist-\ning counterfactual estimation methods usually make the following important assumptions [17].\n\n1Some works [30] de\ufb01ne ITE as the form of CATE: ITEi = E(Y(i)\n\n1 |x) \u2212 E(Y(i)\n\n0 |x).\n\n2\n\n\fFigure 1: Framework of similarity preserved individual treatment effect estimation (SITE).\n\nAssumption 2.1 (SUTVA). The potential outcomes for any unit do not vary with the treatment as-\nsigned to other units, and, for each unit, there are no different forms or versions of each treatment\nlevel, which lead to different potential outcomes [17].\n\nAssumption 2.2 (Consistency). The potential outcome of treatment t equals to the observed out-\ncome if the actual treatment received is t.\n\nAssumption 2.3 (Ignorability). Given pretreatment covariates X, the outcome variables Y0 and Y1\nis independent of treatment assignment, i.e., (Y0, Y1) \u22a5\u22a5 T|X.\nIgnorability assumption makes the ITE estimation identi\ufb01able. Though it\u2019s hard to prove the sat-\nisfaction of the assumption, the researchers can make the assumption more plausible if the pre-\ntreatment covariates include the variables that affect both the treatment assignment and the outcome\nas much as possible. This assumption is also called \u201cno unmeasured confounder\u201d.\n\nAssumption 2.4 (Positivity). For any set of covariates x, the probability to receive treatment 0 or\n1 is positive, i.e., 0 < P (T = t|X = x) < 1,\u2200t and x.\n\nThis assumption is also named as population overlapping [9]. If for some values of X, the treatment\nassignment is deterministic (i.e., P (T = t|X = x) = 0 or 1), we would lack the observations of\none treatment group, such that the counterfactual outcome is unlikely to be estimated. Therefore,\npositivity assumption guarantees that the ITE can be estimated.\n\n2.2 Motivation\nBalancing distributions of control group and treated group has been recognized as an effective strat-\negy for counterfactual estimation. Recent works have applied distribution balancing constraints to\neither the covariate space [16] or latent space [18, 23].\nMoreover, we assume that similar units would have similar outcomes. This assumption has been well\njusti\ufb01ed in many classical counterfactual estimation methods such as the nearest neighbor matching.\nTo satisfy this assumption in the representation learning setting, the local similarity information\nshould be well preserved after mapping units from the covariate space X to the latent space Z.\nOne straightforward solution is to add a constraint on similarity matrices constructed in X and Z.\nHowever, constructing similarity matrices and enforcing such a \u201cglobal\u201d constraint is very time\nand space consuming, especially for a large amount of units in practice. Motivated by the hard\nsample mining approach in the image classi\ufb01cation area [14], we design an ef\ufb01cient local similarity\npreserving strategy based on triplet pairs.\n\n2.3 Proposed Method\nWe propose a local similarity preserved individual treatment effect estimation (SITE) method based\non deep representation learning. The key idea of SITE is to map the original pre-treatment covariate\nspace X into a latent space Z learned by deep neural networks. Particularly, SITE attempts to en-\nforce two special properties on the latent space Z, including the balanced distribution and preserved\nsimilarity.\n\n3\n\n\fFigure 2: Triple pairs selection for PDDM in a mini-batch.\n\nThe framework of SITE is shown in Figure 1, which contains \ufb01ve major components: representation\nnetwork, triplet pairs selection, position-dependent deep metric (PDDM), middle point distance\nminimization (MPDM), and the outcome prediction network. To improve the model ef\ufb01ciency,\nSITE takes input units in a mini-batch fashion, and triplet pairs could be selected from every mini-\nbatch. The representation network learns latent embeddings for the input units. With the selected\ntriplet pairs, PDDM and MPDM are able to preserve the local similarity information and meanwhile\nachieve the balanced distributions in the latent space. Finally, the embeddings of mini-batch are fed\nforward to a dichotomous outcome prediction network to get the potential outcomes.\nThe loss function of SITE is as follows:\n\nL =LFL + \u03b2LPDDM + \u03b3LMPDM + \u03bb||W||2,\n\n(2)\nwhere LFL is the factual loss between the estimated and observed factual outcomes. LPDDM and\nLMPDM are the loss functions for PDDM and MPDM, respectively. The last term is L2 regularization\non model parameters W (except the bias term).\nNext, we describe each component of SITE in detail.\n\n2.3.1 Representation Network\n\nInspired by [18], a standard feed-forward network with dh hidden layers and the recti\ufb01ed linear unit\n(ReLU) activation function is built to learn latent representations from the pre-treatment covariates.\nFor the unit i, we have zi = f (xi), where f (\u00b7) denotes the representation function learned by the\ndeep network.\n\n2.3.2 Triplet Pairs Selection\n\nGiven a mini-batch of input units, SITE selects six units according to the propensity scores. Propen-\nsity score is the probability that a unit receives the treatment [28, 22]. For unit i, the propensity score\n\n(cid:1). It\u2019s obvious that si \u2208 [0, 1]. If si is close to 1, more treated\n\nsi is de\ufb01ned as si = P(cid:0)ti = 1|X = xi\n\nunits should be distributed around the unit i in the covariate space. Analogously, if si is close to 0,\nmore control units are available near the unit i. Moreover, if si is close to 0.5, a mixture of both\ncontrol and treated units can be found around the unit i. Thus, propensity score can kind of re\ufb02ect\nthe relative location of units in the covariate space, and we choose it as the indicator to select six\ndata points. We use the logistic regression to calculate the propensity score [27].\nSelecting three pairs of units in each mini-batch involves three steps, as shown in the left part of\nFigure 2.\n\u2022 Step 1: Choose data pair (x\u02c6i, x\u02c6j) s.t.\n\n(\u02c6i, \u02c6j) = argmin\ni\u2208T ,j\u2208C\n\n(3)\nwhere T and C denote the treated group and control group, respectively. x\u02c6i and x\u02c6j are the closest\nunits in the intermediate region where both control and treated units are mixed.\n\u2022 Step 2: Choose (x\u02c6k, x\u02c6l) s.t.\n\n|si \u2212 0.5| + |sj \u2212 0.5|,\n\n\u02c6k = argmax\n\nk\u2208C\n\n|sk \u2212 s\u02c6i|, \u02c6l = argmax\n\n|sl \u2212 s\u02c6k|.\n\nl\n\n(4)\n\n4\n\n\fx\u02c6k is the farthest control unit from x\u02c6i, and is on the margin of control group with plenty of control\nunits.\n\u2022 Step 3: Choose (x \u02c6m, x\u02c6n) s.t.\n\n\u02c6m = argmax\n\nm\u2208T\n\n|sm \u2212 s\u02c6j|, \u02c6n = argmax\n\n|sn \u2212 s \u02c6m|.\n\nn\n\n(5)\n\nx\u02c6k is the farthest control unit from x\u02c6i, and is on the margin of control group with plenty of control\nunits.\nThe pair (\u02c6i, \u02c6j) lies in the intermediate region of control and treated groups. Pairs (\u02c6k, \u02c6l) and ( \u02c6m, \u02c6n) are\nlocated on the margins that are far away from the intermediate region. The selected triplet pairs can\nbe viewed as hard cases. Intuitively, if the desired property of preserved similarity can be achieved\nfor the hard cases, it will hold for other cases as well. Thus, we focus on preserving such a property\nfor the hard cases (e.g., triplet pairs) in the latent space, and employ PDDM to achieve this goal.\n\n2.3.3 Position-Dependent Deep Metric (PDDM)\n\nhard\n\n[14].\n\noriginally\n\nclassi\ufb01cation\n\nto\nadapt\n\nthe\ndesign\n\naddress\nthis\n\nproposed\nWe\n\nPDDM was\nimage\nthe PDDM component measures\ntion problem.\nIn SITE,\ntwo units based on their\nrelative\nThe PDDM learns a metric that makes\nthe local similarity of (zi, zj) in the la-\ntent space close to their similarity in the\noriginal space. The similarity \u02c6S(i, j) is\nde\ufb01ned as:\n\nand absolute positions\n\nto\n\nsample mining\n\nthe\n\ncounterfactual\n\nthe\n\nlocal\n\nin the\n\nlatent\n\nproblem in\nestima-\nsimilarity of\nspace Z.\n\n\u02c6S(i, j) = Wsh + bs,\n\n(6)\n\nFigure 3: PDDM Structure.\n\n2\n\nu||u||2\n\nv1\n,\n||v1||2\n|zi+zj|\n\n+ bu), v1 = \u03c3(Wv\n\n]T +\n, u1 = \u03c3(Wu\n\nwhere h = \u03c3(Wc[ u1\n||u1||2\nbc), u = |zi \u2212 zj|, v =\n+ bv). Wc, Ws,\nWv, Wu, bc, bs, bv and bu are the model parameters. \u03c3(\u00b7) is a nonlinear function such as ReLU.\nAs shown in Figure 3, the PDDM structure \ufb01rst calculates the feature mean vector v and the ab-\nsolute position vector u of the input (zi, zj), and then feeds v and u to the fully connected layers\nseparately. After normalization, PDDM concatenates the learned vectors u1 and v1, and feeds it to\nanother fully connected layer to get the vector h. The \ufb01nal similarity score \u02c6S(, ) is calculated by\nmapping the score h to the R1 space.\nThe loss function of PDDM is as follows:\nLPDDM = 1\n\n(cid:2)( \u02c6S(\u02c6k, \u02c6l) \u2212 S(\u02c6k, \u02c6l))2 + ( \u02c6S( \u02c6m, \u02c6n) \u2212 S( \u02c6m, \u02c6n))2 + ( \u02c6S(\u02c6k, \u02c6m) \u2212 S(\u02c6k, \u02c6m))2\n+( \u02c6S(\u02c6i, \u02c6m) \u2212 S(\u02c6i, \u02c6m))2 + ( \u02c6S(\u02c6j, \u02c6k) \u2212 S(\u02c6j, \u02c6l))2(cid:3),\n\n(cid:80)\n\n\u02c6i,\u02c6j,\u02c6k,\u02c6l, \u02c6m,\u02c6n\n\nv||v||2\n\n5\n\n2\n\n2 \u2212 0.5|\u2212| si\u2212sj\n\n(7)\nwhere S(i, j) = 0.75| si+sj\n| + 0.5. Similar to the design of the PDDM structure, the\ntrue similarity score S(i, j) is calculated using the mean and the difference of two propensity scores.\nThe loss function LPDDM measures the similarity loss on \ufb01ve pairs in each mini batch: the pairs lo-\ncated in the margin area of the mini batch, i.e., (zk, zl) and (zm, zn); the pair that is most dissimilar\namong the selected points, i.e., (zk, zm); the pairs located in the margin of the control/treated group,\ni.e., (zj, zk) and (zi, zm). As shown in Figure 2, minimizing LPDDM on the above \ufb01ve pairs helps\nto preserve the similarity when mapping the original data into the representation space.\nBy using the PDDM structure, the similarity information within and between each of the pairs\n(z\u02c6k, z\u02c6l), (z \u02c6m, z\u02c6n), and (z\u02c6k, z\u02c6n) will be preserved.\n\n2.3.4 Middle Point Distance Minimization (MPDM)\n\nTo achieve balanced distributions in the latent space, we design the middle point distance minimiza-\ntion (MPDM) component in SITE. MPDM makes the middle point of (z\u02c6i, z \u02c6m) close to the middle\npoint of (z\u02c6j, z\u02c6k). The units z\u02c6i and z\u02c6j are located in a region where the control and treated units\nare suf\ufb01cient and mixed. In other words, they are the closest units from treated and control groups\n\n5\n\n\fFigure 4: The effect of balancing distributions and preserving local similarity by using the proposed\nSITE method.\n\nseparately that lie in the intermediate zone. Meanwhile, z\u02c6k is the farthest control unit from the\nmargin of the treated group, and z \u02c6m is the farthest treated unit from the margin of control group.\nWe use the middle points of (z\u02c6i, z \u02c6m) and (z\u02c6j, z\u02c6k) to approximate the centers of treated and control\ngroups, respectively. By minimizing the distance of two middle points, the units in the margin area\nare gradually made close to the intermediate region. As a result, the distributions of two groups will\nbe balanced.\nThe loss function of MPDM is as follows:\n\nLMPDM = (cid:80)\n\n\u02c6i,\u02c6j,\u02c6k, \u02c6m\n\n(cid:0) z\u02c6i+z \u02c6m\n2 \u2212 z\u02c6j +z\u02c6k\n\n2\n\n(cid:1)2\n\n.\n\n(8)\n\nThe MPDM balances the distributions of two groups in the latent space, while the PDDM preserves\nthe local similarity. A 2-D toy example shown in Figure 4 vividly demonstrates the combined effect\nof MPDM and PDDM. Four units x\u02c6i, x\u02c6j, x\u02c6k and x \u02c6m are the same as what we choose in Figure 2.\nFigure 4 shows that MPDM makes the units that belong to treated group close to the control group,\nand PDDM restricts the way that the two groups close to each other. PDDM preserves the similarity\ninformation between x\u02c6k and x \u02c6m. x\u02c6k and x \u02c6m are the farthest data points in the treated and control\ngroups. When MPDM makes two groups approaching each other, PDDM ensures that the data\npoints x\u02c6k and x \u02c6m are still the farthest, which prevents MPDM squeezing all data points into one\npoint.\n\n2.3.5 Outcome Prediction Network\n\nWith the components PDDM and MPDM, SITE is able to learn latent representations zi that balance\nthe distributions of treated/control groups and preserve the local similarity of units in the original\ncovariate space. Finally, the outcome prediction network is employed to estimate the outcome \u02c6y(i)\nby taking zi as input. Let g(\u00b7) denote the function learned by the outcome prediction network. We\nti\nhave \u02c6y(i)\nThe factual loss function is as follows:\n\nti = g(zi, ti) = g(f (xi), ti).\n\nN(cid:88)\n\nLFL =\n\nN(cid:88)\n\nti \u2212 y(i)\n(\u02c6y(i)\n\nti )2 =\n\n(g(f (xi), ti) \u2212 y(i)\n\nti )2,\n\n(9)\n\ni=1\n\ni=1\n\nwhere y(i)\nti\n\nis the observed outcome.\n\n2.3.6 Implementation and Joint Optimization\n\nThe representation network and outcome prediction network are standard feed-forward neural net-\nworks with Dropout [32] and ReLU activation function. The overall loss function of SITE in Eq.(2)\ncan be jointly optimized. Adam [19] is adopted to solve the optimization problem. The PDDM and\nMPDM are calculated on triplet pairs during every batch.\n\n6\n\n\fTable 1: Performance comparison on IHDP and Jobs Dataset.\n\nJobs (Rpol)\n\nIHDP ((cid:112)EPEHE)\n\nWithin-sample\n10.761 \u00b1 4.350\n10.280 \u00b1 3.794\n2.439 \u00b1 0.445\n7.188 \u00b1 2.679\n4.432 \u00b1 2.345\n4.732 \u00b1 2.974\n3.827 \u00b1 2.044\n0.729 \u00b1 0.088\n0.663 \u00b1 0.068\n0.649 \u00b1 0.089\n0.604 \u00b1 0.093\n\nOut-of-sample\n7.345 \u00b1 2.914\n5.245 \u00b1 0.986\n2.401 \u00b1 0.367\n7.290 \u00b1 3.389\n4.303 \u00b1 2.077\n4.095 \u00b1 2.528\n4.874 \u00b1 2.850\n1.342 \u00b1 0.597\n1.202 \u00b1 0.550\n1.152 \u00b1 0.527\n0.656 \u00b1 0.108\n\nWithin-sample\n0.310 \u00b1 0.017\n0.228 \u00b1 0.012\n0.291 \u00b1 0.019\n0.292 \u00b1 0.019\n0.230 \u00b1 0.016\n0.232 \u00b1 0.018\n0.232 \u00b1 0.008\n0.228 \u00b1 0.004\n0.213 \u00b1 0.006\n0.225 \u00b1 0.004\n0.224 \u00b1 0.004\n\nOut-of-sample\n0.279 \u00b1 0.067\n0.733 \u00b1 0.103\n0.311 \u00b1 0.069\n0.307 \u00b1 0.053\n0.262 \u00b1 0.038\n0.224 \u00b1 0.034\n0.240 \u00b1 0.012\n0.234 \u00b1 0.012\n0.231 \u00b1 0.009\n0.225 \u00b1 0.010\n0.219 \u00b1 0.009\n\nMethod\nOLS/LR1\nOLS/LR2\n\nHSIC-NNM [5]\n\nPSM [27]\nk-NN [8]\n\nCausal Forest [33]\n\nBNN [18]\n\nTARNet [30]\n\nCFR-MMD [30]\nCFR-WASS [30]\n\nSITE (Ours)\n\n3 Experiment\n\n3.1 Experiment on Real Dataset\n\nDatasets. Due to the missing counterfactual outcomes in reality, it is hard to measure the individual\ntreatment effect estimation on traditional observational datasets. In order to evaluate the proposed\nmethod, we conduct the experiment on three datasets with different settings. IHDP and Jobs dataset\nare adopted in [30], one of the state-of-art methods. IHDP dataset aims to estimate the effect of\nspecialist home visits on infant\u2019s future cognitive test scores, and Jobs dataset aims to estimate the\neffect of job training on employee status. Details about the IHDP and Jobs datasets are provided in\nthe supplementary material. The twins dataset comes from the all twins birth in the USA between\n1989 \u2212 1991 [2]. We focus on the same sex twin-pairs whose weights are less than 2000g. Each\nrecord contains 40 pre-treatment covariates related to the parents, the pregnancy and the birth. The\ntreatment T = 1 is viewed as being the heavier one in the twins, and T = 0 is being the lighter one.\nThe outcome is the mortality after one year. After eliminating the records containing missing fea-\ntures, the \ufb01nal dataset contains 5409 records. In this setting, both treated and control outcomes can\nbe observed. In order to create the selection bias, we execute the following procedures to selectively\nchoose one of the twins as the observation and hide the other: Ti|xi \u223c Bern(Sigmoid(wT x + n)),\nwhere wT \u223c U((\u22120.1, 0.1)40\u00d71) and n \u223c N (0, 0.1).\nBaselines. We compare the proposed method with the following three groups of baselines. (1) Re-\ngression based methods: Least square Regression with the treatment as feature (OLS/LR1), separate\nlinear regressors for each treatment group (OLS/LR2); (2) Nearest neighbor matching based meth-\nods: Hilbert-Schmidt Independence Criterion based Nearest Neighbor Matching (HSIC-NNM) [5],\nPropensity score match with logistic regression (PSM) [27], k-nearest neighbor (k-NN) [8]; (3) Tree\nand forest based method: Causal Forest [33]. (4) Representation learning based methods: Balancing\nneural network (BNN) [18], counterfactual regression with MMD metric (CFR-MMD) [30], coun-\nterfactual regression with Wasserstein metric (CFR-WASS) [30], and Treatment-Agnostic Repre-\nsentation Network (TARNet) [30].\nPerformance Measurement.\ngeneous Effect\n1\nN\n\n(cid:80)N\nis used as the metric, which is de\ufb01ned as: Rpol = 1\u2212(cid:0)E[Y1|\u03c0(x) = 1]P(\u03c0(x) = 1)+E[Y0|\u03c0(x) =\n0]P(\u03c0(x) = 0)(cid:1), where \u03c0(x) = 1 if \u02c6y1\u2212 \u02c6y0 > 0 and \u03c0(x) = 0, otherwise. The policy risk measures\n\n(EPEHE)\n1 )\u223cPY|xi\n\nOn IHDP dataset,\nthe Precision in Estimation of Hetero-\nis adopted as the performance metric, where EPEHE =\n1 ))2; On jobs dataset, the policy risk Rpol [30]\n0 \u2212 \u02c6y(i)\n\n[13]\n0 \u2212 y(i)\ny(i)\n\n(cid:105)\u2212 (\u02c6y(i)\n\nthe expected loss if the treatment is taken according to the ITE estimation. For PEHE and policy\nrisk, the smaller value is, the better the performance. On the Twins dataset, the class is imbalanced,\nso we adopt area over ROC curve(AUC) on outcomes as the performance measure, as suggested in\n[25]. The larger AUC is, the better the performance.\n\ni=1(E\n\n(y(i)\n\n0 ,y(i)\n\n(cid:104)\n\n1\n\n7\n\n\fMethod\nOLS/LR1\nOLS/LR2\n\nPSM [27]\nk-NN [8]\nBNN [18]\n\nHSIC-NNM [5]\n\nTwins (AUC)\n\nTable 2: Performance comparison on twins dataset.\n\nOn each dataset, we consider both the within-sample case and out-of-sample case [30]. In the former\ncase, the observed outcome is available, while in the latter case, only the pre-treatment covariates\nare available. In the within-sample case, the performance metric is measured on the training dataset,\nand the out-of-sample case is on the test dataset. Since we never use the ground truth ITE during\nthe training procedure, performance metric is a meaningful metric in both the within-sample and\nout-of-sample cases.\nResults Analysis 2. Tables 1 and 2 show the performance of 10 realizations of our method and base-\nlines on three datasets. SITE achieves the best performance on the IHDP and Twins datasets, and on\nthe Jobs dataset, SITE achieves similar results to the best baseline. It con\ufb01rms that preserving the lo-\ncal similarity information during representation learning can help better estimate the counterfactual\noutcomes and ITE.\nGenerally speaking, the representation learning based methods perform better than the linear re-\ngression based and nearest neighbor matching based methods. The regression-based methods are\nnot specially designed to deal with counterfactual inference, so the performance is affected by\nthe selection bias. The nearest neighbor based methods incorporate the similarity information\nto overcome the selection bias, but they\nonly use the observed outcomes of neigh-\nbors in the other group as their counter-\nfactual outcomes, which might be inaccu-\nrate and unreliable.\nAmong the representation learning based\nmethods, our proposed method outper-\nforms all other baselines. The meth-\nods considering balancing distributions\n(BNN, CFR MMD, CFR WASS, and the\nproposed method) obtain better perfor-\nmance than the method without balanc-\ning property (TARNet). BNN balances\nthe distributions of two treatment groups\nin the representation space and views the\ntreatment ti as a feature. While TARNet\ndoesn\u2019t have any regularization in the rep-\nresentation space, and its outcome predic-\ntion network is dichotomous. CFR-MMD\nand CFR-WASS have the same dichotomous outcome prediction networks, but they use different in-\ntegral probability metrics to balance the distributions. The results of BNN, CFR MMD, CFR WASS,\nand the proposed method SITE indicate that balancing the distributions of different treatment groups\nindeed helps reduce the negative effect of selection bias.\nCompared with CFR-MMD and CFR-WASS, our proposed method SITE not only considers the bal-\nancing property (MPDM), but also preserves the local similarity information in the original feature\nspace (PDDM). It is observed that on the IHDP dataset, SITE signi\ufb01cantly improves the results in\nboth within-sample case and out-of-sample case. On Jobs and Twins datasets, the performance of\nSITE are comparable with the best baseline. The results on three datasets demonstrate the effective-\nness of preserving local similarity information in the latent space. Moreover, with the speci\ufb01cally\ndesigned PDDM and MPDM structures, SITE can ef\ufb01ciently calculate the similarity information\nand balance the distributions of different treatment groups. The PDDM and MPDM structures only\nrequire the selected triplet pairs, which avoids handling the entire dataset. By jointly considering\ndistribution balancing and similarity preserving, the proposed method can effectively and ef\ufb01ciently\nestimate the individual treatment effect.\nExperiments on PDDM and MPDM. PDDM (for local similarity preserving) and MPDM (for\nbalancing) aim to reduce the generalization error when inferring the potential outcomes. As SITE\nassumes that similar units shall have similar treatment outcomes, PDDM and MPDM are able to\npreserve the local similarity information and meanwhile achieve the balanced distributions in the\nlatent space. In order to further com\ufb01rm the effect of PDDM and MPDM, we compare SITE with\nSITE-without-PDDM and SITE-without-MPDM on all the three datasets. Table 3 shows the re-\n\nWithin-sample\n0.660 \u00b1 0.005\n0.660 \u00b1 0.004\n0.762 \u00b1 0.011\n0.500 \u00b1 0.003\n0.609 \u00b1 0.010\n0.690 \u00b1 0.008\n0.849 \u00b1 0.002\n0.852 \u00b1 0.001\n0.850 \u00b1 0.002\n0.862 \u00b1 0.002\n\nOut-of-sample\n0.500 \u00b1 0.028\n0.500 \u00b1 0.016\n0.501 \u00b1 0.017\n0.506 \u00b1 0.011\n0.492 \u00b1 0.012\n0.676 \u00b1 0.008\n0.840 \u00b1 0.006\n0.840 \u00b1 0.006\n0.842 \u00b1 0.005\n0.853 \u00b1 0.006\n\nTARNet [30]\n\nCFR-MMD [30]\nCFR-WASS [30]\n\nSITE (Ours)\n\n2The code of SITE is available at https://github.com/Osier-Yi/SITE.\n\n8\n\n\fsults. It can be observed that SITE outperforms the baselines without PDDM or MPDM structures.\nTherefore, the two structures, PDDM and MPDM, are necessary to improve the ITE estimation.\n\nTable 3: Experiment on PDDM & MPDM: Performance Comparison on Three Datasets.\nDataset\n\nSITE-without-PDDM SITE-without-MPDM\n\nJobs (Rpol)\n\nIHDP (EPEHE) Within-sample\nOut-of-sample\nWithin-sample\nOut-of-sample\nTwins (AUC) Within-sample\nOut-of-sample\n\nSITE\n\n0.604 \u00b1 0.093\n0.656 \u00b1 0.108\n0.224 \u00b1 0.004\n0.219 \u00b1 0.009\n0.862 \u00b1 0.002\n0.853 \u00b1 0.006\n\n0.635 \u00b1 0.127\n0.685 \u00b1 0.128\n0.233 \u00b1 0.004\n0.234 \u00b1 0.012\n0.770 \u00b1 0.033\n0.776 \u00b1 0.033\n\n0.859 \u00b1 0.093\n1.416 \u00b1 0.476\n0.222 \u00b1 0.003\n0.234 \u00b1 0.009\n0.796 \u00b1 0.040\n0.788 \u00b1 0.040\n\n3.2 Experiment on Synthetic Dataset\n\nData Generation. To evaluate the robustness of\nSITE, we design experiments on a synthetic dataset.\nFollowing the settings in [36],\nthe synthetic data\nare generated as follows: we generate 5000 control\nsamples from N (010\u00d71, 0.5 \u00d7 (\u03a3 + \u03a3T )) and 2500\ntreated samples from N (\u00b51, 0.5 \u00d7 (\u03a3 + \u03a3T )), where\n\u03a3 \u223c U ((\u22121, 1)10\u00d710). By varying the value of \u00b51,\ndata with different levels of selection bias are gener-\nated. Kullback-Leibler divergence (KL divergence) is\nadopted to measure the selection bias. The larger the\nKL divergence is, the smaller the overlapping of sim-\nulated control and treated groups is, and the larger\nthe selection bias is. The outcome is generated as\ny|x \u223c (wT x + n), where w \u223c U ((\u22121, 1)10\u00d72), and\nn \u223c N (02\u00d71, 0.1 \u00d7 I 2\u00d72).\nResult Analysis. We compare the proposed method with the most competitive baselines, TARNet,\nCFR-MMD and CFR-WASS. The mean and variance of the EP EHE on 10 realizations are reported\nin Figure 5. It is observed from the \ufb01gure that SITE consistently outperforms baseline methods\nunder different levels of divergence.\n\nFigure 5: Performance Comparison on\nSynthetic Dataset.\n\n4 Conclusion\n\nIn this paper, we present an ef\ufb01cient deep representation learning method for estimating individual\ntreatment effect. The proposed method jointly preserves the local similarity information and bal-\nances the distributions of control and treated groups. Experimental results on the IHDP, Jobs and\nTwins datasets show that, in most cases, our method achieves better performance than the state-\nof-the-art. Extensive evaluation of our method further validates the bene\ufb01ts of preserving local\nsimilarity in ITE estimation.\n\n5 Acknowledgment\n\nThis work was supported in part by the US National Science Foundation under grants NSF IIS-\n1747614, IIS-1218393 and IIS-1514204. Any opinions, \ufb01ndings, and conclusions or recommenda-\ntions expressed in this material are those of the author(s) and do not necessarily re\ufb02ect the views of\nthe National Science Foundation. Also, we gratefully acknowledge the support of NVIDIA Corpo-\nration with the donation of the Titan Xp GPU used for this research.\n\n9\n\n1248110195KL divergence00.20.40.60.81CFR-MMDCFR-WASSTARNetSITE\fReferences\n[1] A. M. Alaa and M. van der Schaar. Bayesian inference of individualized treatment effects\nusing multi-task gaussian processes. In Advances in Neural Information Processing Systems,\npages 3427\u20133435, 2017.\n\n[2] D. Almond, K. Y. Chay, and D. S. Lee. The costs of low birth weight. The Quarterly Journal\n\nof Economics, 120(3):1031\u20131083, 2005.\n\n[3] S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings\n\nof the National Academy of Sciences, 113(27):7353\u20137360, 2016.\n\n[4] S. Athey, J. Tibshirani, and S. Wager. Generalized random forests.\n\narXiv:1610.01271, 2016.\n\narXiv preprint\n\n[5] Y. Chang and J. G. Dy. Informative subspace learning for counterfactual inference. In Pro-\nceedings of the Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, February 4-9, 2017,\nSan Francisco, California, USA., pages 1770\u20131776, 2017.\n\n[6] V. Chernozhukov, I. Fern\u00b4andez-Val, and B. Melly. Inference on counterfactual distributions.\n\nEconometrica, 81(6):2205\u20132268, 2013.\n\n[7] H. A. Chipman, E. I. George, R. E. McCulloch, et al. Bart: Bayesian additive regression trees.\n\nThe Annals of Applied Statistics, 4(1):266\u2013298, 2010.\n\n[8] R. K. Crump, V. J. Hotz, G. W. Imbens, and O. A. Mitnik. Nonparametric tests for treatment\n\neffect heterogeneity. The Review of Economics and Statistics, 90(3):389\u2013405, 2008.\n\n[9] A. D\u2019Amour, P. Ding, A. Feller, L. Lei, and J. Sekhon. Overlap in observational studies with\n\nhigh-dimensional covariates. arXiv preprint arXiv:1711.02582, 2017.\n\n[10] M. Dud\u00b4\u0131k, J. Langford, and L. Li. Doubly robust policy evaluation and learning. In Proceedings\nof the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington,\nUSA, June 28 - July 2, 2011, pages 1097\u20131104, 2011.\n\n[11] M. J. Funk, D. Westreich, C. Wiesen, T. St\u00a8urmer, M. A. Brookhart, and M. Davidian. Doubly\nrobust estimation of causal effects. American journal of epidemiology, 173(7):761\u2013767, 2011.\n\n[12] T. A. Glass, S. N. Goodman, M. A. Hern\u00b4an, and J. M. Samet. Causal inference in public health.\n\nAnnual review of public health, 34:61\u201375, 2013.\n\n[13] J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational\n\nand Graphical Statistics, 20(1):217\u2013240, 2011.\n\n[14] C. Huang, C. C. Loy, and X. Tang. Local similarity-aware deep feature embedding. In Ad-\nvances in Neural Information Processing Systems 29: Annual Conference on Neural Infor-\nmation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1262\u20131270,\n2016.\n\n[15] S. M. Iacus, G. King, and G. Porro. Causal inference without balance checking: Coarsened\n\nexact matching. Political analysis, 20(1):1\u201324, 2012.\n\n[16] K. Imai and M. Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical\n\nSociety: Series B (Statistical Methodology), 76(1):243\u2013263, 2014.\n\n[17] G. W. Imbens and D. B. Rubin. Causal inference in statistics, social, and biomedical sciences.\n\nCambridge University Press, 2015.\n\n[18] F. D. Johansson, U. Shalit, and D. Sontag. Learning representations for counterfactual infer-\nence. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016,\nNew York City, NY, USA, June 19-24, 2016, pages 3020\u20133029, 2016.\n\n[19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n10\n\n\f[20] K. Kuang, P. Cui, B. Li, M. Jiang, and S. Yang. Estimating treatment effect in the wild via\ndifferentiated confounder balancing. In Proceedings of the 23rd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17,\n2017, pages 265\u2013274, 2017.\n\n[21] M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness.\n\nNeural Information Processing Systems, pages 4069\u20134079, 2017.\n\nIn Advances in\n\n[22] B. K. Lee, J. Lessler, and E. A. Stuart. Improving propensity score weighting using machine\n\nlearning. Statistics in medicine, 29(3):337\u2013346, 2010.\n\n[23] S. Li and Y. Fu. Matching on balanced nonlinear representations for treatment effects estima-\n\ntion. In Advances in Neural Information Processing Systems, pages 930\u2013940, 2017.\n\n[24] S. Li, N. Vlassis, J. Kawale, and Y. Fu. Matching via dimensionality reduction for estimation\nof treatment effects in digital marketing campaigns. In Proceedings of the 25th International\nJoint Conference on Arti\ufb01cial Intelligence, pages 3768\u20133774, 2016.\n\n[25] C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal effect\nIn Advances in Neural Information Processing\n\ninference with deep latent-variable models.\nSystems, pages 6449\u20136459, 2017.\n\n[26] J. Pearl. Causality. Cambridge university press, 2009.\n[27] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational\n\nstudies for causal effects. Biometrika, 70(1):41\u201355, 1983.\n\n[28] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational\n\nstudies for causal effects. Biometrika, 70(1):41\u201355, 1983.\n\n[29] D. B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.\n\nJournal of educational Psychology, 66(5):688, 1974.\n\n[30] U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generaliza-\ntion bounds and algorithms. In Proceedings of the 34th International Conference on Machine\nLearning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3076\u20133085, 2017.\n\n[31] J. Splawa-Neyman, D. M. Dabrowska, and T. Speed. On the application of probability theory\nto agricultural experiments. essay on principles. section 9. Statistical Science, pages 465\u2013472,\n1990.\n\n[32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A\nsimple way to prevent neural networks from over\ufb01tting. The Journal of Machine Learning\nResearch, 15(1):1929\u20131958, 2014.\n\n[33] S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using\n\nrandom forests. Journal of the American Statistical Association, 2017.\n\n[34] P. Wang, W. Sun, D. Yin, J. Yang, and Y. Chang. Robust tree-based causal inference for\ncomplex ad effectiveness analysis. In Proceedings of the Eighth ACM International Conference\non Web Search and Data Mining, pages 67\u201376. ACM, 2015.\n\n[35] Y. Wang, L. Solus, K. Yang, and C. Uhler. Permutation-based causal inference algorithms\nwith interventions. In Advances in Neural Information Processing Systems, pages 5824\u20135833,\n2017.\n\n[36] J. Yoon, J. Jordon, and M. van der Schaar. GANITE: Estimation of individualized treatment\neffects using generative adversarial nets. In International Conference on Learning Represen-\ntations, 2018.\n\n[37] K. Zhang, M. Gong, and B. Sch\u00a8olkopf. Multi-source domain adaptation: A causal view. In\n\nAAAI, pages 3150\u20133157, 2015.\n\n[38] S. Zhao and N. Heffernan. Estimating individual treatment effect from educational studies\nwith residual counterfactual networks. In Proceedings of the 10th International Conference on\nEducational Data Mining, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1344, "authors": [{"given_name": "Liuyi", "family_name": "Yao", "institution": "State University of New York at Buffalo"}, {"given_name": "Sheng", "family_name": "Li", "institution": "University of Georgia"}, {"given_name": "Yaliang", "family_name": "Li", "institution": "Tencent Medical AI Lab"}, {"given_name": "Mengdi", "family_name": "Huai", "institution": "State University of New York at Buffalo"}, {"given_name": "Jing", "family_name": "Gao", "institution": "University at Buffalo"}, {"given_name": "Aidong", "family_name": "Zhang", "institution": "SUNY Buffalo"}]}