{"title": "Graph-Based Semi-Supervised Learning with Non-ignorable Non-response", "book": "Advances in Neural Information Processing Systems", "page_first": 7015, "page_last": 7025, "abstract": "Graph-based semi-supervised learning is a very powerful tool in classification tasks, while in most existing literature the labelled nodes are assumed to be randomly sampled. When the labelling status depends on the unobserved node response, ignoring the missingness can lead to significant estimation bias and handicap the classifiers. This situation is called non-ignorable non-response. To solve the problem, we propose a Graph-based joint model with Non-ignorable Non-response (GNN), followed by a joint inverse weighting estimation procedure incorporated with sampling imputation approach. Our method is proved to outperform some state-of-art models in both regression and classification problems, by simulations and real analysis on the Cora dataset.", "full_text": "Graph-Based Semi-Supervised Learning with\n\nNonignorable Nonresponses\n\nFan Zhou1, Tengfei Li2, Haibo Zhou2, Jieping Ye3, Hongtu Zhu3,2\n\nShanghai University of Finance and Economics1, zhoufan@mail.shufe.edu.cn\n\nUniversity of North Carolina at Chapel Hill2, tengfei_li@med.unc.edu,zhou@bios.unc.edu\n\nAI Labs, Didi Chuxing 3, {yejieping,zhuhongtu}@didiglobal.com\n\nAbstract\n\nGraph-based semi-supervised learning is very important for many classi\ufb01cation\ntasks, but most existing methods assume that all labelled nodes are randomly\nsampled. With the presence of nonignorable nonresponse, ignoring all missing\nnodes can lead to signi\ufb01cant estimation bias and handicap the classi\ufb01ers. To solve\nthis issue, we propose a Graph-based joint model with Nonignorable Missingness\n(GNM) and develop an imputation and inverse probability weighting estimation\napproach. We further use graphical neural networks to model nonlinear link\nfunctions and then use a gradient descent (GD) algorithm to estimate all the\nparameters of GNM. We prove the identi\ufb01ability of the GNM model and validate\ntheir predictive performance in both simulations and real data analysis through\ncomparing with models ignoring or misspecifying the missingness mechanism.\nOur method can achieve up to 7.5% improvement than the baseline model for the\ndocument classi\ufb01cation task on the Cora dataset.\n\n1\n\nIntroduction\n\nGraph-based semi-supervised learning problem has been increasingly studied due to more and more\nreal graph datasets. The problem is to predict all the unlabelled nodes in the graph based on only a\nsmall subset of nodes being observed. A popular method is to use the graph Laplacian regularization\nto learn node representations, such as label propagation [25] and manifold regularization [3]. Recently,\nattention has shifted to the learning of network embeddings [12, 13, 20, 7, 23, 10, 6]. Almost all\nexisting methods assume that the labelled nodes are randomly selected. However, the probability of\nmissingness may depend on the unobserved data after conditioning on the observed data in the real\nworld. For example, when predicting the traf\ufb01c volume of a road network, sensors used to collect\ndata are usually set up at intersections with large traf\ufb01c \ufb02ow. A researcher is more likely to label\nthe documents in a citation network that fall into the categories which he or she is more familiar\nwith. In these cases, non-responses may be missing not at random (MNAR). Ignoring nonignorable\nnonresponses may be unable to capture the representativeness of remaining samples, leading to\nsigni\ufb01cant estimation bias.\nModeling non-ignorable missingness is challenging because the MNAR mechanism is usually\nunknown and may require additional model identi\ufb01ability assumptions [5, 14, 21]. A popular method\nassigns the inverse of estimated response probabilities as weights to the observed nodes [16, 4],\nbut these procedures are designed for the missing at random (MAR) mechanism instead of MNAR.\nAnother method is to impute missing data by using observed data [18, 19, 11]. Some more advanced\nmethods [24, 21] have been proposed to estimate the non-ignorable missingness using external\ndata [8], but such data is often unavailable in many applications, making these methods infeasible.\nMoreover, all these methods are built on simple regressions and are not directly applied to graphs.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper, we develop a Graph-based joint model with Nonignorable Missingness (GNM) by\nassigning inverse response probability to labelled nodes when estimating the target classi\ufb01er or\nregression. To model the non-ignorable missingness, we propose a deep learning based exponential\ntilting model to utilize the strengths of neural networks in function approximation and representation\nlearning. The main contributions of this paper can be summarized as follows:\n\nneural network architectures and its easily checked suf\ufb01cient conditions.\n\n\u2022 To the best of our knowledge, we are the \ufb01rst to consider the graph-based semi-supervised\nlearning problem in the presence of non-ignorable nonresponse and try to solve the problem\nfrom the perspective of missing data.\n\u2022 We introduce a novel identi\ufb01ability in prediction-equivalence quotient (PEQ) space for\n\u2022 Different from traditional statistical methods which extract features and \ufb01t the prediction\nmodel separately, we propose a novel joint estimation approach by integrating the inverse\nweighting framework with a modi\ufb01ed loss function based on the imputation of non-response.\nIt is easy to implement in practice and is robust to the normality assumption when the node\nresponse is continuous.\n\u2022 We use gradient descent (GD) algorithm to learn all the parameters, which works for\n\u2022 We examine the \ufb01nite sample performance of our methods by using both simulation and\nreal data experiments, demonstrating the necessity of \u2019de-biasing\u2019 in acquiring unbiased\nprediction results on the testing data under the non-ignorable nonresponse setting.\n\ntraditional regression model as well as for modern deep graphical neural networks.\n\n2 Model Description\nLet G = (V, E, A) be a weighted graph, where V = {v1, . . . , vN} denotes the vertex set of size\n|V | = N, E contains all the edges, and A is an N \u00d7 N adjacency matrix. The N vertexes make\nup the whole population with only a small subset of vertexes being labelled. We introduce some\nimportant notations as follows:\n(i). x = [x1, x2, . . . , xN ]T \u2208 RN\u00d7p is a fully observed input feature matrix of size N \u00d7 p with each\nxi \u2208 Rp being a p \u00d7 1 feature vector at vertex vi.\n(ii). Y = (y1, y2, . . . , yN )T is a vector of vertex responses, which is partially observed subject to\nmissingness, and yi can be either categorical or continuous.\n(iii). A \u2208 RN\u00d7N is the adjacency matrix (binary or weighted), which encodes node similarity and\nnetwork connectivity. Speci\ufb01cally, aij represents the edge weight between vertexes vi and vj.\n(iv). ri \u2208 {0, 1} is a \u201clabeling indicator\u201d, that is yi is observed if and only if ri = 1. Let\nR = {1, . . . , n} denote the set of labelled vertexes and Rc = {n + 1, . . . , N} de\ufb01nes the subsample\nof non-respondents for which the vertex label is missing.\n(v) G A(x; \u03b8g) \u2208 RN\u00d7q denotes a q \u00d7 1 vector of unknown function of x, which can be a deep neural\nnetwork incorporating the network connectivity A, such as a multi-layer GCN [10] or GAT [22].\nIn this paper, we consider an non-ignorable response mechanism, where the indicator variable\nri depends on yi (which is unobserved when ri = 0). It is assumed that ri follows a Bernoulli\ndistribution as follows:\n\nri|(yi, h(xi; \u03b8h)) \u223c Bernoulli(\u03c0i),\n\n(1)\nwhere h(xi; \u03b8h) is an unknown parametric function of xi and \u03c0(yi, h(xi; \u03b8h)) = P (ri =\n1|yi, h(xi; \u03b8h)) is the probability of missingness for yi. Given G A(x; \u03b8g), yi and yj are assumed\nto be independent and given yi and h(xi; \u03b8h), ri and rj are assumed to be independent for i (cid:54)= j.\nFurthermore, an exponential tilting model is proposed for \u03c0i as follows:\n\n\u03c0(yi, h(xi; \u03b8h)) = \u03c0(yi, h(xi; \u03b8h); \u03b1r, \u03b3, \u03c6) =\n\nexp{\u03b1r + \u03b3T h(xi; \u03b8h) + \u03c6yi}\n1 + exp{\u03b1r + \u03b3T h(xi; \u03b8h) + \u03c6yi} .\n\n(2)\n\nOur question of interest is to unbiasedly learn an outcome model Y |x. Without loss of generality,\nwhen y is continuous, we consider a linear model given by\n\nY = \u03b1 + G A(x; \u03b8g)\u03b2 + \u0001,\n\n(3)\n\n2\n\n\fwhere \u0001 = (\u00011,\u00b7\u00b7\u00b7 , \u0001N )T \u223c N (0, \u03c32I) and \u0001 \u22a5 x is the error term with zero unconditional mean,\nthat is, E(\u0001i) = 0. In this case, dropping out missing data can lead to strongly biased estimates\nwhen r depends on y. The parameter estimates will not be consistent since E{\u0001i|ri = 1} and\nE{\u0001iG A(x; \u03b8g)i|ri = 1} are not zero. The missing values could not be imputed even if we would\nhave consistent estimates since\n\nE{yi|ri = 0, G A(x; \u03b8g)i; \u03b1, \u03b2} =\n\nE{yi(1 \u2212 ri)|G A(x; \u03b8g)i; \u03b1, \u03b2}\n1 \u2212 P (ri = 1|G A(x; \u03b8g)i; \u03b1, \u03b2)\n\n(4)\n\n= \u03b1 + \u03b2T G A(x; \u03b8g)i \u2212 cov(yi, \u03c0i|G A(x; \u03b8g)i; \u03b1, \u03b2)\n1 \u2212 E(\u03c0i|G A(x; \u03b8g)i; \u03b1, \u03b2)\nK(cid:88)\n\n(cid:54)= \u03b1 + \u03b2T G A(x; \u03b8g)i.\n\nn(cid:89)\n\n(cid:90)\n\nN(cid:89)\n\ni=1\n\ni=n+1\n\n(cid:90)\n\nWhen y is a K-class discrete variable, we consider an multicategorical logit model as follow:\n\nP (yi = k|G A(x; \u03b8g)i; \u03b1k, \u03b2k) = exp(\u03b1k + \u03b2T\n\nG A(x; \u03b8g)i)/\n\nk\n\nexp(\u03b1j + \u03b2T\nj\n\nG A(x; \u03b8g)i) \u2200k (5)\n\nTherefore, we can de\ufb01ne a joint model of (1) and (3) (or (1) and (5)), called Graph-based joint model\nwith Nonignorable Missingness (GNM) to obtain the unbiased estimation of Y |x.\n\nj=1\n\n3 Estimation\n\nWe examine several important properties, such as identi\ufb01ability, of GNM and its estimation algorithm\nin this section.\n\n3.1\n\nIdenti\ufb01ability\n\nWe consider the identi\ufb01ability property of GNM. Let Y = (yT\nobs, yT\nprobability density function (pdf) of the observed data is given by\n\nmis)T and J = (R, Rc). The joint\n\nf (yobs, J|x) = f (y1, y2 . . . , yn, r1, . . . , rN|x) =\n\nf (yi, ri|x)\n\nf (yi, ri|x)dyi. (6)\n\nBased on the assumptions of ri|(yi, h(xi)) and yi|G A(x; \u03b8g)i, (6) is equivalent to\n\n[P (ri = 1|yi, h(xi; \u03b8h))f (yi|G A(x; \u03b8g)i)]ri[1\u2212\n\nP (ri = 1|y, h(xi; \u03b8h))f (y|G A(x; \u03b8g)i)dy]1\u2212ri.\n\n(cid:89)\n\ni\n\n(7)\nThe GNM model is called identi\ufb01able if for different sets of parameters (\u03b8h, \u03b8g), P (ri =\n1|yi, h(xi; \u03b8h))f (yi|G A(x; \u03b8g)i) are different functions of (yi, x). The identi\ufb01ability implies that in\na positive probability, the global maximum of (7) is unique.\nHowever, identi\ufb01ability may fail for many neural network models. For example, the identi\ufb01ability of\nparameters in (1) is one of the necessary conditions for model identi\ufb01ability, which can fail for the\nRelu network. Speci\ufb01cally, we have\nLogit[P (ri = 1|yi, h(zi; \u03b2r)); \u03b3] = \u03b1r+\u03b3Relu(zi\u03b2r)+\u03c6yi = Logit[P (ri = 1|yi, h(zi; 2\u03b2r)); \u03b3/2].\nFortunately, this type of non-identi\ufb01ability does not create any prediction discrepancy, since under\nGNM, the prediction of y given x is exactly the same for different (\u03b3, \u03b8h, \u03b2, \u03b8g) and (\u03b3(cid:48), \u03b8(cid:48)\nh, \u03b2(cid:48), \u03b8(cid:48)\ng)\nif we have\n\n\u03b3T h(x; \u03b8h) = \u03b3(cid:48)T h(x; \u03b8(cid:48)\n\nh), and G A(x; \u03b8g)\u03b2 = G A(x; \u03b8(cid:48)\n\ng)\u03b2(cid:48).\n\n(8)\n\nIn consideration of the prediction equivalence, a more useful de\ufb01nition of identi\ufb01ability is given in the\nfollowing. Let f (yi|G A(x)i; \u03b8y) = f (yi|G A(x; \u03b8g)i; \u03b1, \u03b2) and P (ri = 1|yi, h(zi); \u03b8r) = P (ri =\n1|yi, h(zi; \u03b8h); \u03b1r, \u03b3, \u03c6), where \u03b8y = (\u03b1, \u03b2, \u03b8g) and \u03b8r = (\u03b1r, \u03b3, \u03c6, \u03b8h) contain unknown parameters\nin the outcome model Y |x and the missing data model r|(y, z). The D(\u03b8y) \u2297 D(\u03b8r) denotes the\ndomain of (\u03b8y, \u03b8r), where \u2297 is the tensor product of two spaces.\n\n3\n\n\fDe\ufb01nition 3.1. Under GNM, we call (\u03b8y, \u03b8r) is equivalent to (\u03b8(cid:48)\n\ny, \u03b8(cid:48)\n\nr), denoted by\n\n(\u03b8y, \u03b8r) \u223c (\u03b8(cid:48)\n\ny, \u03b8(cid:48)\nr),\n\nif (8) holds and \u03b1(cid:48) = \u03b1, \u03b1(cid:48)\n(\u03b1(cid:48), \u03b2(cid:48), \u03b8(cid:48)\nr = (\u03b1(cid:48)\n[[(\u03b8y, \u03b8r)]], de\ufb01ned as the set\n\ng), and \u03b8(cid:48)\n\nr = \u03b1r and \u03c6(cid:48) = \u03c6, where \u03b8y = (\u03b1, \u03b2, \u03b8g), \u03b8r = (\u03b1r, \u03b3, \u03c6, \u03b8h), \u03b8(cid:48)\nr, \u03b3(cid:48), \u03c6(cid:48), \u03b8(cid:48)\n\ny =\nh). The equivalence class of an element (\u03b8y, \u03b8r) is denoted by\n\n[[(\u03b8y, \u03b8r)]] = {(\u03b8(cid:48)\n\ny, \u03b8(cid:48)\n\nr) \u2208 D(\u03b8y) \u2297 D(\u03b8r)|(\u03b8(cid:48)\n\ny, \u03b8(cid:48)\n\nr) \u223c (\u03b8y, \u03b8r)},\n\nand the set of all equivalent classes is called the Prediction-Equivalent Quotient (PEQ) space,\ndenoted by S = D(\u03b8y) \u2297 D(\u03b8r)/ \u223c . The GNM model is called identi\ufb01able on the PEQ space iff that\n\nf (y|G A(x)i; \u03b8y)P (r = 1|y, h(xi); \u03b8r) = f (y|G A(x)i; \u03b8(cid:48)\n\ny)P (r = 1|y, h(xi); \u03b8(cid:48)\nr)\n\ny, \u03b8(cid:48)\nr).\n\nholds for all x, y implies (\u03b8y, \u03b8r) \u223c (\u03b8(cid:48)\nDifferent from identi\ufb01ability on the parameter space, the identi\ufb01ability on the PEQ space implies\nthe uniqueness of the prediction given x instead of parameter estimation. It is applicable to complex\narchitecture that focuses more on prediction than parameter. The following is an example which is\nnot identi\ufb01abile on both parameter space and PEQ space.\nExample 1. Let G A(x; \u03b8g) = x, h(x; \u03b8h) = x, yi \u223c N (\u00b5 + x\u03b2, 1), and P (ri = 1|yi) = [1 +\nexp(\u2212\u03b1r \u2212 x\u03b3 \u2212 \u03c6yi)]\u22121 with unknown real-valued \u03b1r, \u03b3, \u03c6, \u00b5 and \u03b2, and thus\nexp[\u2212(yi \u2212 \u00b5 \u2212 xi\u03b2)2/2]\n2\u03c0[1 + exp(\u2212\u03b1r \u2212 \u03c6yi \u2212 \u03b3x)]\n\nP (ri = 1|yi, h(xi))f (yi|G A(x)i) =\n\n\u221a\n\n(9)\n\n.\n\nIn this case, two different sets of parameters (\u03b1r, \u03b3, \u03c6, \u00b5, \u03b2) and (\u03b1(cid:48)\nr, \u03b3(cid:48), \u03c6(cid:48), \u00b5(cid:48), \u03b2(cid:48)) produce equal\n(9) values if \u03b1r = \u2212(\u00b52 \u2212 \u00b5(cid:48)2)/2, \u03b2(cid:48) = \u03b2, \u03c6 = \u00b5(cid:48) \u2212 \u00b5, \u03b3 = \u03b2(\u00b5 \u2212 \u00b5(cid:48)), \u03b1(cid:48)\nr = \u2212\u03b1r, \u03c6(cid:48) = \u2212\u03c6, and\n\u03b3 = \u2212\u03b3(cid:48). The observed likelihood is only identi\ufb01able with ignorable missingness, i.e. \u03c6 = \u03c6(cid:48) = 0.\nAdditional conditions are required to ensure the identi\ufb01ability of GNM on the PEQ space.\nTheorem 3.1. Assume three conditions as follows.\n(A1) For all \u03b8g, there exist (x1, x2) such that G A(x1; \u03b8g)i (cid:54)= G A(x2; \u03b8g)i for each i; \u03b2 (cid:54)= 0 holds.\n(A2) For all \u03b8g and z, there exists (u1, u2) such that G A([z, u1]; \u03b8g)i (cid:54)= G A([z, u2]; \u03b8g)i for each i;\nand \u03b2 (cid:54)= 0 holds.\n(A3) For all \u03b8h, there exists (z1, z2) such that h(z1; \u03b8h) (cid:54)= h(z2; \u03b8h); and \u03b3 (cid:54)= 0 holds.\nThe GNM model (1) and (5) is identi\ufb01able on the PEQ space under Condition (A1). Suppose that\nthere exists an instrumental variable u in x = [z, u] such that f (yi|G A(x)i) depends on u, whereas\nP (ri = 1|yi, h(xi)) does not. Then the GNM model (1) and (3) is identi\ufb01able on the PEQ space\nunder Conditions (A2) and (A3).\nRegularity conditions (A1)\u223c(A3) are easy to satisfy.\n\n3.2 Estimation Approach\n\nIt is not easy to directly maximize the full likelihood function (6) in practice since it can be extremely\ndif\ufb01cult to compute its integration term. On the other hand, the normality assumption of the error\nterm can be restrictive for GNM consisting of (1) and (3). Therefore, we propose a doubly robust\n(DR) estimation approach to alternatively obtain the Inverse Probability Weighted Estimator (IPWE)\nof \u03b8y and imputation estimator of \u03b8r [15, 1].\nInverse Probability Weighted Estimator (IPWE) of \u03b8y\n\nWith \u03c0(yi, h(xi); \u03b8r) estimated by \u03c0(yi, h(xi);(cid:98)\u03b8r), the Inverse Probability Weighted Estimator\n\n(IPWE) of \u03b8y can be obtained by minimizing the weighted cross-entropy loss\n\n1(yi = k)log(P (yi = k|G A(x)i; \u03b8y))\n\n(10)\n\nL1(\u03b8y|(cid:98)\u03b8r) = \u2212(cid:88)\n\n\u03c0(yi, h(xi);(cid:98)\u03b8r)\n\nri\n\ni\n\nK(cid:88)\n\nk=1\n\n4\n\n\f(cid:88)\n\nwhen Y |x follows (5) or by minimizing the weighted mean squared error (MSE)\n{yi \u2212 \u03b1 \u2212 \u03b2T G A(x; \u03b8g)i)}2\n\nL1(\u03b8y|(cid:98)\u03b8r) =\n\nri\n\n\u03c0(yi, h(xi);(cid:98)\u03b8r)\n\ni\n\nwhen Y is continuous. The estimation equation (11) is robust with respect to the normality assumption.\n\nIf \u03c0(yi, h(xi); \u03b8r) is correctly speci\ufb01ed, the IPW estimator of \u03b8y that solves \u2202L1(\u03b8y|(cid:98)\u03b8r)/\u2202\u03b8y = 0 is\nE\u03b8y{(cid:80)\n\nconsistent and converges to \u03b8y according to the following theorem.\nTheorem 3.2.\n\nthen a given estimating function l(yi, G A(x)i; \u03b8y) with\n\ni l(yi, G A(x)i; \u03b8y)} = 0 satis\ufb01es\n\nis known,\n\nIf \u03b8r\n\n(11)\n\nE\u03b8y{(cid:88)\n\ni\n\nri\n\n\u03c0(yi, h(xi); \u03b8r)\n\nl(yi, G A(x)i; \u03b8y)} = 0.\n\nri=0\n\nri=1\n\nImputation estimator of \u03b8r\n\nWith the estimated f (Y |G A(x;(cid:98)\u03b8g)), we could obtain an estimator of \u03b8r by minimizing\nL2(\u03b8r|(cid:98)\u03b8y) = \u2212(cid:88)\nlog(1 \u2212 E{\u03c0(yi, h(xi); \u03b8r)|x;(cid:98)\u03b8y}),\nwhere \u03c0(yi, h(xi); \u03b8r) = P (ri = 1|yi, h(xi); \u03b8r) and E{\u03c0(yi, h(xi))|x;(cid:98)\u03b8y} = (cid:82) P (ri =\n1|y, h(xi); \u03b8r)f (y|G A(x)i;(cid:98)\u03b8y)dy. One advantage of our proposed joint estimation approach is\n\nlog(\u03c0(yi, h(xi); \u03b8r)) \u2212(cid:88)\n\nthat E(\u03c0(yi, h(xi); \u03b8r)|x) can be easily approximated by the empirical average of a set of random\ndraws at the nodes with missing y as the imputed responses:\nE{\u03c0(yi, h(xi); \u03b8r)|x; \u03b8y} =\n\nP (ri = 1|y, h(xi); \u03b8r)f (y|G A(x)i; \u03b8y)dy \u2248 B\u22121(cid:88)\n\niid\u223c f (y|G A(x)i;(cid:98)\u03b8y). Thus, we can get an unbiased estimate of (12) by replacing the\n\nexpectation by an empirical mean over samples generated from f (y|G A(x)i;(cid:98)\u03b8y) as follows:\n(cid:102)L2(\u03b8r|(cid:98)\u03b8y) = \u2212(cid:88)\n\nln(\u03c0(yi, h(xi); \u03b8r)) \u2212(cid:88)\n\nwhere {yib}B\n\nlog(1 \u2212 B\u22121\n\n\u03c0(yib, h(xi); \u03b8r)),\n\n(cid:88)\n\n(cid:90)\n\n(12)\n\nb=1\n\nb\n\n\u03c0(yib, h(xi); \u03b8r),\n\nthe gradient of which can be expressed as\n\nri=1\n\nri=0\n\n\u2207\u03b8r(cid:102)L2(\u03b8r|(cid:98)\u03b8y) = \u2212(cid:88)\n\n\u2207\u03b8r \u03c0i\n\u03c0i\n\nri=1\n\n(cid:88)\n\n+\n\nri=0\n\nB\u22121(cid:80)\n1 \u2212 B\u22121(cid:80)\n\nyib\u223cf (y|G A(x)i;(cid:98)\u03b8y)\n\nb \u2207\u03b8r \u03c0(yib, h(xi); \u03b8r)\nb \u03c0(yib, h(xi); \u03b8r)\n\n(13)\n\n(14)\n\n.\n\nThe imputation estimator of \u03b8r by minimizing L2(\u03b8r|\u03b8y) is consistent when f (Y |G A(x; \u03b8g)) is\ncorrectly speci\ufb01ed. The overall estimation procedure is schematically depicted in Figure 1.\n\n3.3 Algorithm\n\nr\n\nand \u03b8(e+1)\n\n= arg min\u03b8y\n\nr ) and \u03b8(e+1)\n\n= arg min\u03b8r (cid:102)L2(\u03b8r|\u03b8(e+1)\n\nIn this subsection, we provide more details of our proposed imputation and IPW estimation ap-\nproach about how to jointly estimate \u03b8y and \u03b8r by alternatively minimizing the conditional loss\n\nfunctions L1(\u03b8y|(cid:98)\u03b8r) and (cid:102)L2(\u03b8r|(cid:98)\u03b8y) in practice. Speci\ufb01cally, we update \u03b8y and then \u03b8r with\n\nL1(\u03b8y|\u03b8(e)\n\u03b8(e+1)\n) in order at each epoch,\ny\nwhere \u03b8(e)\nare the estimates of \u03b8r and \u03b8y obtained at the e-th and (e + 1)-th epoch,\nrespectively. We use the gradient descent (GD) algorithm to learn all the parameters in \u03b8r and \u03b8y,\nwhile incorporating the network architecture of G A(x; \u03b8g) and h(x; \u03b8h).\nWithout specifying the normal assumption when yi is continuous, we replace the random draw y(e)\nib in\ng )i at the e-th epoch. It can be seen as an approximation\n(13) by the expectation of \u03b20 + \u03b2T\n1\nobtained by linearizing \u03c0(yi, h(xi)) using a Taylor series expansion and taking the expectation of the\n\ufb01rst two terms [2]:\n\nG A(x; \u03b8(e)\n\ny\n\ny\n\nr\n\nE{\u03c0(yi, h(xi))|x; \u03b8(e)\n\ny } \u2248 \u03c0(E(yi|x; \u03b8(e)\n\ny ), h(xi)) = \u03c0(\u03b20 + \u03b2T\n1\n\nG A(x; \u03b8(e)\n\ng )i, h(xi)).\n\n5\n\n\fFigure 1: General Picture of the Joint Estimation Approach\n\nIn this case, it is equivalent to let B = 1 and the sample size, i.e. the total number of nodes will be\n\ufb01xed at each training epoch. Based on simulations and real experiments below, this simpli\ufb01cation\nstill outperforms the baseline models with a signi\ufb01cant improvement in the prediction accuracy on\nnon-response nodes.\nThe details of the algorithm are described in \ufb01ve steps as follows:\n\n1. Determine the initial value of the response probability \u03c0(0)\n\n(or \u03b8(0)\n\nr ). For example, we can\n\ni\n\nlet \u03c0(0)\n\ni = 1 for all the labelled vertexes (ri = 1).\n\n2. Let e = 1, where e represents the number of epoch. We update \u03b8y based on \u03c0(0)\n\nobtained\nfrom the previous epoch by minimizing the loss function in (10) using GD. At the i-th\niteration within the e-th epoch, we update \u03b8y as follows:\n\ni\n\ny \u2212 \u03b30\u2207\u03b8y\nwhere \u03b30 is the learning rate and L1(\u03b8y|\u03b8(e\u22121)\n\u03c0(e\u22121)\n\n= \u03c0i(yi, h(xi); \u03b8(e\u22121)\n\n\u2190 \u03b8(e,i)\n\n\u03b8(e,i+1)\ny\n\nr\n\nr\n\ni\n\nL1(\u03b8y|\u03b8(e\u22121)\n) represents the loss function based on\n\n(15)\n\n),\n\nr\n\n3. Impute yi for all the unlabelled nodes ri = 0 using y(e)\n\ni = \u03b2(e)\n\n0 + G A(x; \u03b8(e)\n\ncontinuous case and sampling y(e)\n\ni\n\n4. We use GD to update \u03b8r. Speci\ufb01cally, at the j-th iteration, we have\n\n). We denote the updated \u03b8y as \u03b8(e)\ny\n\nafter M (e) iterations.\n\nfrom distribution P (yi|G A(x)i; \u03b8(e)\n\nr \u2212 \u03b31\u2207\u03b8r(cid:102)L2(\u03b8r|\u03b8(e)\n\ny )\n\n\u2190 \u03b8(e,j)\n\ng )T\n\ni \u03b2(e)\n1\ny ) otherwise.\n\nfor the\n\n(16)\n\n\u03b8(e,j+1)\nr\nequal to \u03b8(e\u22121)\nwith the initial start \u03b8(e,0)\ncan get the estimate of \u03b8r denoted as \u03b8(e)\nr\nthe sampling weight \u03c0(e)\n\nr\n\nr\n\ni\n\n, and \u03b31 is the learning rate. After convergence, we\nat the end of this training epoch. Then we update\n\nbased on P (ri = 1|yi, h(xi); \u03b8(e)\n\nr ) for all labelled vertexes.\n\n5. Stop once convergence has been achieved, otherwise let e = e + 1 and return to step 3.\n\nThe convergence criterion is that whether the imputed unlabelled vertexes at epoch e only slightly\ndiffer from those at epoch (e \u2212 1). In other words, the iteration procedure is stopped if\n\n(cid:88)\n\n(cid:88)\n\ni \u2212 y(e\u22121)\n|y(e)\n\ni\n\n|/\n\n1(ri = 0) \u2264 \u03b5\n\nri=0\n\ni\n\nWe let M0 and M1 be the maximal number of allowed internal iterations at each epoch for updating\n\u03b8y and \u03b8r, respectively. For more details, you can refer to the Algorithm 1 in the supplements.\nTheoretically, the complexity (for example one layer GCN) in Steps 2 and 3 is O(|E|pq) at each\nepoch according to [10], where |E| < N 2 is the number of edges. Moreover, the complexity of Step\n4 is O(N p) when h is a one fully connected layer.\n\n6\n\n\f4 Experiments\n\nIn this section, simulations and one real data analysis are conducted to evaluate the empirical\nperformance of our proposed methods and a baseline model, which ignores the non-response (SM).\nOur GNM model reduces to SM when it only contains the outcome model Y |x given in (3) (or (5))\nwith the weights in loss (10) being 1 for all samples. In the real data part, GNM is also compared with\nthe model with a misspeci\ufb01ed ignorable missing mechanism, and some other state-of-art \u2019de-biasing\u2019\nmethods. In the simulation part, we simulate the node response y based on (3) and generate the\nlabelled set by the exponential tilting model (1). For the real data analysis, we evaluate all the\ncompared models by a semi-supervised document classi\ufb01cation on the citation network-Cora with\nnon-ignorable non-response.\nIn this paper, we use GCN to learn the latent node representations G A(x) with the layer-wise\npropagation de\ufb01ned as\n\nH (l+1) = f (H (l), A) = \u03c3((cid:98)D\u2212 1\n\n2 (cid:98)A(cid:98)D\u2212 1\n\nwhere (cid:98)A = A + I, in which I is an identity matrix, and (cid:98)D is the diagonal vertex degree matrix of (cid:98)A.\n\nThe W (l) is a weight matrix for the l-th layer and \u03c3(\u00b7) is an non-linear activation function. H (0) = x\nis the initial input and G A(x) = H (2) \u2208 RN\u00d7 \u00afp is the output of the second layer-wise propagation.\nTo be fair, we let G A(x) be a 2-layer GCN model for all compared approaches.\n\n2 H (l)W (l)),\n\n(17)\n\n4.1 Simulations\nWe consider a network data generated by |V | = 2708 vertexes together with a binary adjacency\nmatrix A. x \u2208 R2708\u00d71433 denotes the fully observed input features which is a large-scale sparse\nmatrix. Both A and x are obtained from the Cora dataset. The node response is simulated from the\nfollowing model:\n\n(18)\nwhere \u0001i \u223c N (0, \u03c32) and G A(x) is the output of a 2-layer GCN model. We let response probability\n\u03c0 depend on the unobserved vertex response y only , and (1) is simpli\ufb01ed to\n\nyi = \u03b20 + \u03b2T\n1\n\nG A(x)i + \u0001i,\n\n\u03c0i \u2261 P (ri = 1|yi) =\n\nexp{\u03b1r + \u03c6yi}\n1 + exp{\u03b1r + \u03c6yi} .\n\n(19)\n\nIn this case, the instrumental variable u is exactly x itself, and the identi\ufb01ability automatically holds\naccording to Theorem 3.1. All \u03b2\u2019s in (18) are sampled from uniform distribution U (0, 1). The \u03b1r\nand \u03c6 were selected to make the overall missing proportion be approximately 90%. The labelled\nsubset are randomly split into training and validation sets, while the remaining non-response nodes\nbuild the testing set. We train all the compared models for a maximum of 200 epochs (E = 200)\n\nusing Adam [9] with a learning rate 0.05 and make predictions(cid:98)yi for each testing vertex. Training is\n\nstopped when validation loss does not decrease in 15 consecutive iterations. We keep all other model\nsettings used by [10] and \ufb01x the unit size of the \ufb01rst hidden layer to be 16.\nTable 1 summarizes the estimation results under different (\u00afp, \u03c3) combinations, where root mean\nsquared error (RMSE) and Mean absolute percentage error (MAPE) are computed between the true\n\nnode response y and prediction(cid:98)y over the 50 runs. We can clearly see that GNM outperforms SM\n\nunder all the four settings with much smaller mean RMSEs and MAPEs. Moreover, GNM is more\nstable than SM with smaller estimation variance.\n\n4.2 Real Data Analysis\n\nFor the real data analysis, we modify the Cora to a binary-class data by merging the six non-\n\u2019Neural Network\u2019 classes together. The global prevalence of two new classes are (0.698, 0.302) with\nN0 = #{y = 0} = 1890 and N1 = #{y = 1} = 818, respectively.\nTwo missing mechanisms are considered. A simple setup is the same as (19). In this case, we compare\nour method with the inverse weighting approach proposed by [17]. We let the two functions of x\nrequired to estimate \u03c0 under their framework to be the constant 1 and the \ufb01rst principle component\nj xj. In a\n\n(PC) score, which is more stable compared to other functions such as a general xj or(cid:80)\n\n7\n\n\f\u00afp\n4\n\n16\n\nSM\n\nSM\n\n\u03c3 Method Metric Mean\n1.1925\n0.5\n0.2932\n0.6983\n0.1995\n1.6185\n0.3104\n1.2103\n0.2263\n0.7923\n0.2014\n0.6015\n0.1672\n1.4212\n0.2129\n1.1316\n0.1849\n\nRMSE\nMAPE\nGNM RMSE\nMAPE\nRMSE\nMAPE\nGNM RMSE\nMAPE\nRMSE\nMAPE\nGNM RMSE\nMAPE\nRMSE\nMAPE\nGNM RMSE\nMAPE\n\nSM\n\nSM\n\n0.5\n\n1\n\n1\n\nSD\n\n6.43e-1\n2.01e-1\n1.28e-2\n1.00e-2\n8.58e-2\n4.73e-2\n4.81e-2\n2.28e-2\n9.94e-2\n2.42e-2\n2.17e-2\n1.90e-2\n2.14e-1\n1.05e-2\n6.04e-2\n4.62e-3\n\nTable 1: Mean RMSEs and MAPEs by GNM and\nSM based on simulated data sets\n\nFigure 2: Boxplot of RMSEs in real data analysis\n\nmore complicated setup, the labelled nodes are generated based on\n\n\u03c0i \u2261 P (ri = 1|yi, h(xi)) =\n\nj xij/a0 \u2212 a1)\u2212 ((cid:80)\n\nexp{\u03b1r + \u03b3T h(xi) + \u03c6yi}\n1 + exp{\u03b1r + \u03b3T h(xi) + \u03c6yi} ,\n\nwhere h(xi) = exp((cid:80)\n\n(20)\nj xij \u2212 a2)/a3 with value range being [0, 1]. The explicit\nform of h(x) is assumed to be unknown and we use a multi-layer perceptron to approximate it. The\nnetwork has two hidden layers with 128 and 64 units. respectively, and we use the \u2019tanh\u2019 activation\nfor the \ufb01nal output layer. As a comparison, we also include the results when the \u2019non-ignorable\u2019\nmissingness is over-simpli\ufb01ed to the \u2019ignorable\u2019 one (GIM). We let nk = #{(yi = k) \u2227 (ri = 1)},\nand use \u03bb to denotes the size ratio between the two groups of labelled nodes, i.e. n1/n0. We carry out\nmore experiments on other datasets including \u2019Citeseer\u2019, and explore the \ufb01nite sample performance\nof our method using other state-of-art architecture such as GAT [22]. More details are provided in\nthe supplementary materials. 1\n\nAccuracy\n\n1.5\n\nSM\n\n\u03bb Method Mean\n0.8683\n1\nRosset\n0.8514\nGNM 0.8947\nSM\n0.8458\nRosset\n0.8311\nGNM 0.8908\n0.8052\nSM\nRosset\n0.8193\nGNM 0.8648\n\n2\n\nSD\n\n1.98e-2\n5.19e-2\n6.47e-3\n2.21e-2\n7.09e-2\n1.26e-2\n3.26e-2\n6.05e-2\n2.54e-2\n\nTable 2: Mean Prediction Ac-\ncuracy for the simple setup by\neach method\n\nFigure 3: Boxplot Prediction Accuracy for the simple setup\n\nResults are summarized in Tables 2 and 3. Reported values represent the average classi\ufb01cation\naccuracy on testing data by 50 replications with re-sampling allowed. In each setup, two \u2019de-biasing\u2019\nmethods including our approach are compared with SM. We adjust \u03b1 and \u03b2 to make the size of training\nset be around 120 for each sub-setting. Increasing \u03bb reduces the number of included y = 0 nodes\nin the training set, leading to an insuf\ufb01cient learning power and thus a lower overall classi\ufb01cation\n\n1Our implementation of GNM can be found at: https://github.com/BIG-S2/keras-gnm\n\n8\n\n\fAccuracy\n\n\u03bb Method Mean\n1\n0.8663\nSM\nGIM\n0.8713\nGNM 0.8961\n0.8141\nSM\nGIM\n0.8291\nGNM 0.8669\n\n2\n\nSD\n\n1.21e-2\n1.52e-2\n1.18e-2\n2.34e-2\n2.79e-2\n1.63e-2\n\nTable 3: Mean Prediction Ac-\ncuracy for the complicated\nsetup by each method\n\nFigure 4: Boxplot of Prediction Accuracy for the complicated setup\n\nFigure 5: Number of iteration times for GNM and SM at each epoch under sub-setting one\n\naccuracy. For the simple setup, GNM signi\ufb01cantly outperforms compared models by increasing\nthe baseline prediction accuracy by 3.1% - 7.4%. On the other hand, GNM is less sensitive to the\nsample selection and has smaller variance compared to the method by [17]. For the complicated\nsetup, mis-specifying the \u2019Non-Ignorable\u2019 missingness as \u2019Ignorable\u2019 still has big biases even though\nachieving some improvement against SM. The mean prediction accuracy by GNM is between 3.7%\nto 4.8% higher than that by GIM.\nIn both sub-settings, our method always leads to the smallest estimation variance, which is less\naffected by the selection of labelled nodes. For both setups, higher \u03bb value leads to bigger sampling\nbias, and subsequently there is more signi\ufb01cant improvement in the prediction accuracy. Figures 3\nand 4 are the boxplots of prediction accuracy obtained from each method under the two model setups.\nIt may intuitively demonstrates the necessity of taking into account missing mechanism in order to\nachieve higher prediction accuracy on the unlabelled nodes.\nWe also empirically analyze the computational ef\ufb01ciency of our algorithm. The number of epochs for\nGNM to achieve convergence in the 50-run real-data experiments is 3 (21), 4 (19), 5 (7), and 6 (3) in\nSetting 1. Figure 5 summaries the number of iterations for the 2-layer GCN in SM and those for Step\n2 of our algorithm at each epoch. It is demonstrated that the computational cost of our GNM model\nat each epoch is comparable to that of the baseline GCN model.\n\n9\n\n\fReferences\n[1] Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal\n\ninference models. Biometrics, 61(4):962\u2013973, 2005.\n\n[2] Jean-Francois Beaumont. An estimation method for nonignorable nonresponse. Survey Method-\n\nology, 26(2):131\u2013136, 2000.\n\n[3] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric\nframework for learning from labeled and unlabeled examples. Journal of machine learning\nresearch, 7(Nov):2399\u20132434, 2006.\n\n[4] James R Carpenter, Michael G Kenward, and Stijn Vansteelandt. A comparison of multiple\nimputation and doubly robust estimation for analyses with missing data. Journal of the Royal\nStatistical Society: Series A (Statistics in Society), 169(3):571\u2013584, 2006.\n\n[5] Kani Chen. Parametric models for response-biased sampling. Journal of the Royal Statistical\n\nSociety: Series B (Statistical Methodology), 63(4):775\u2013789, 2001.\n\n[6] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks\non graphs with fast localized spectral \ufb01ltering. In Advances in Neural Information Processing\nSystems, pages 3844\u20133852, 2016.\n\n[7] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks.\n\nIn\nProceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 855\u2013864. ACM, 2016.\n\n[8] Jae Kwang Kim and Cindy Long Yu. A semiparametric estimation of mean functionals with\nnonignorable missing data. Journal of the American Statistical Association, 106(493):157\u2013165,\n2011.\n\n[9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[10] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907, 2016.\n\n[11] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data, volume 793.\n\nWiley, 2019.\n\n[12] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013.\n\n[13] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-\nsentations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 701\u2013710. ACM, 2014.\n\n[14] Jing Qin, Denis Leung, and Jun Shao. Estimation with survey data under nonignorable nonre-\nsponse or informative sampling. Journal of the American Statistical Association, 97(457):193\u2013\n200, 2002.\n\n[15] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coef\ufb01cients\nwhen some regressors are not always observed. Journal of the American statistical Association,\n89(427):846\u2013866, 1994.\n\n[16] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Analysis of semiparametric regression\nmodels for repeated outcomes in the presence of missing data. Journal of the american statistical\nassociation, 90(429):106\u2013121, 1995.\n\n[17] Saharon Rosset, Ji Zhu, Hui Zou, and Trevor J Hastie. A method for inferring label sampling\nmechanisms in semi-supervised learning. In Advances in neural information processing systems,\npages 1161\u20131168, 2005.\n\n[18] Donald B Rubin. Inference and missing data. Biometrika, 63(3):581\u2013592, 1976.\n\n10\n\n\f[19] Joseph L Schafer and Nathaniel Schenker. Inference with imputed conditional means. Journal\n\nof the American Statistical Association, 95(449):144\u2013154, 2000.\n\n[20] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-\nscale information network embedding. In Proceedings of the 24th International Conference\non World Wide Web, pages 1067\u20131077. International World Wide Web Conferences Steering\nCommittee, 2015.\n\n[21] Niansheng Tang, Puying Zhao, and Hongtu Zhu. Empirical likelihood for estimating equations\n\nwith nonignorably missing data. Statistica Sinica, 24(2):723, 2014.\n\n[22] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua\n\nBengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.\n\n[23] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning\n\nwith graph embeddings. arXiv preprint arXiv:1603.08861, 2016.\n\n[24] Hui Zhao, Pu-Ying Zhao, and Nian-Sheng Tang. Empirical likelihood inference for mean\nfunctionals with nonignorably missing response data. Computational Statistics & Data Analysis,\n66:101\u2013116, 2013.\n\n[25] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian\n\ufb01elds and harmonic functions. In Proceedings of the 20th International conference on Machine\nlearning (ICML-03), pages 912\u2013919, 2003.\n\n11\n\n\f", "award": [], "sourceid": 3795, "authors": [{"given_name": "Fan", "family_name": "Zhou", "institution": "Shanghai University of Finance and Economics"}, {"given_name": "Tengfei", "family_name": "Li", "institution": "UNC Chapel Hill"}, {"given_name": "Haibo", "family_name": "Zhou", "institution": "University of North Carolina at Chapel Hill"}, {"given_name": "Hongtu", "family_name": "Zhu", "institution": "UNC Chapel Hill"}, {"given_name": "Ye", "family_name": "Jieping", "institution": "DiDi Chuxing"}]}