{"title": "Learning Influence Functions from Incomplete Observations", "book": "Advances in Neural Information Processing Systems", "page_first": 2073, "page_last": 2081, "abstract": "We study the problem of learning influence functions under incomplete observations of node activations. Incomplete observations are a major concern as most (online and real-world) social networks are not fully observable. We establish both proper and improper PAC learnability of influence functions under randomly missing observations. Proper PAC learnability under the Discrete-Time Linear Threshold (DLT) and Discrete-Time Independent Cascade (DIC) models is established by reducing incomplete observations to complete observations in a modified graph. Our improper PAC learnability result applies for the DLT and DIC models as well as the Continuous-Time Independent Cascade (CIC) model.  It is based on a parametrization in terms of reachability features, and also gives rise to an efficient and practical heuristic. Experiments on synthetic and real-world datasets demonstrate the ability of our method to compensate even for a fairly large fraction of missing observations.", "full_text": "Learning In\ufb02uence Functions from Incomplete\n\nObservations\n\nXinran He\nYan Liu\nUniversity of Southern California, Los Angeles, CA 90089\n\nDavid Kempe\n\nKe Xu\n\n{xinranhe, xuk, dkempe, yanliu.cs}@usc.edu\n\nAbstract\n\nWe study the problem of learning in\ufb02uence functions under incomplete observa-\ntions of node activations. Incomplete observations are a major concern as most\n(online and real-world) social networks are not fully observable. We establish\nboth proper and improper PAC learnability of in\ufb02uence functions under randomly\nmissing observations. Proper PAC learnability under the Discrete-Time Linear\nThreshold (DLT) and Discrete-Time Independent Cascade (DIC) models is estab-\nlished by reducing incomplete observations to complete observations in a modi\ufb01ed\ngraph. Our improper PAC learnability result applies for the DLT and DIC models\nas well as the Continuous-Time Independent Cascade (CIC) model. It is based\non a parametrization in terms of reachability features, and also gives rise to an\nef\ufb01cient and practical heuristic. Experiments on synthetic and real-world datasets\ndemonstrate the ability of our method to compensate even for a fairly large fraction\nof missing observations.\n\nIntroduction\n\n1\nMany social phenomena, such as the spread of diseases, behaviors, technologies, or products, can\nnaturally be modeled as the diffusion of a contagion across a network. Owing to the potentially high\nsocial or economic value of accelerating or inhibiting such diffusions, the goal of understanding\nthe \ufb02ow of information and predicting information cascades has been an active area of research\n[10, 7, 9, 14, 1, 20]. A key task here is learning in\ufb02uence functions, mapping sets of initial adopters\nto the individuals who will be in\ufb02uenced (also called active) by the end of the diffusion process [10].\nMany methods have been developed to solve the in\ufb02uence function learning problem [9, 7, 5, 8, 3,\n16, 18, 24, 25]. Most approaches are based on \ufb01tting the parameters of a diffusion model based on\nobservations, e.g., [8, 7, 18, 9, 16]. Recently, Du et al. [3] proposed a model-free approach to learn\nin\ufb02uence functions as coverage functions; Narasimhan et al. [16] establish proper PAC learnability of\nin\ufb02uence functions under several widely-used diffusion models.\nAll existing approaches rely on the assumption that the observations in the training dataset are\ncomplete, in the sense that all active nodes are observed as being active. However, this assumption\nfails to hold in virtually all practical applications [15, 6, 2, 21]. For example, social media data are\nusually collected through crawlers or acquired with public APIs provided by social media platforms,\nsuch as Twitter or Facebook. Due to non-technical reasons and established restrictions on the APIs, it\nis often impossible to obtain a complete set of observations even for a short period of time. In turn,\nthe existence of unobserved nodes, links, or activations may lead to a signi\ufb01cant misestimation of the\ndiffusion model\u2019s parameters [19, 15].\nWe take a step towards addressing the problem of learning in\ufb02uence functions from incomplete\nobservations. Missing data are a complicated phenomenon, but to address it meaningfully and\nrigorously, one must make at least some assumptions about the process resulting in the loss of data.\nWe focus on random loss of observations: for each activated node independently, the node\u2019s activation\nis observed only with probability r, the retention rate, and fails to be observed with probability\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f1  r. Random observation loss naturally occurs when crawling data from social media, where rate\nrestrictions are likely to affect all observations equally.\nWe establish both proper and improper PAC learnability of in\ufb02uence functions under incomplete\nobservations for two popular diffusion models: the Discrete-Time Independent Cascade (DIC) and\nDiscrete-Time Linear Threshold (DLT) models. In fact, randomly missing observations do not\neven signi\ufb01cantly increase the required sample complexity. The result is proved by interpreting the\nincomplete observations as complete observations in a transformed graph,\nThe PAC learnability result implies good sample complexity bounds for the DIC and DLT mod-\nels. However, the PAC learnability result does not lead to an ef\ufb01cient algorithm, as it involves\nmarginalizing a large number of hidden variables (one for each node not observed to be active).\nTowards designing more practical algorithms and obtaining learnability under a broader class of\ndiffusion models, we pursue improper learning approaches. Concretely, we use the parameterization\nof Du et al. [3] in terms of reachability basis functions, and optimize a modi\ufb01ed loss function\nsuggested by Natarajan et al. [17] to address incomplete observations. We prove that the algorithm\nensures improper PAC learning for the DIC, DLT and Continuous-Time Independent Cascade (CIC)\nmodels. Experimental results on synthetic cascades generated from these diffusion models and\nreal-world cascades in the MemeTracker dataset demonstrate the effectiveness of our approach. Our\nalgorithm achieves nearly a 20% reduction in estimation error compared to the best baseline methods\non the MemeTracker dataset.\nSeveral recent works also aim to address the issue of missing observations in social network analysis,\nbut with different emphases. For example, Chierichetti et al. [2] and Sadikov et al. [21] mainly focus\non recovering the size of a diffusion process, while our task is to learn the in\ufb02uence functions from\nseveral incomplete cascades. Myers et al. [15] mainly aim to model unobserved external in\ufb02uence\nin diffusion. Duong et al. [6] examine learning diffusion models with missing links from complete\nobservations, while we learn in\ufb02uence functions from incomplete cascades with missing activations.\nMost related to our work are papers by Wu et al. [23] and simultaneous work by Lokhov [13]. Both\nstudy the problem of network inference under incomplete observations. Lokhov proposes a dynamic\nmessage passing approach to marginalize all the missing activations, in order to infer diffusion model\nparameters using maximum likelihood estimation, while Wu et al. develop an EM algorithm. Notice\nthat the goal of learning the model parameters differs from our goal of learning the in\ufb02uence functions\ndirectly. Both [13] and [23] provide empirical evaluation, but do not provide theoretical guarantees.\n2 Preliminaries\n2.1 Models of Diffusion and Incomplete Observations\nDiffusion Model. We model propagation of opinions, products, or behaviors as a diffusion process\nover a social network. The social network is represented as a directed graph G = (V, E), where\nn = |V | is the number of nodes, and m = |E| is the number of edges. Each edge e = (u, v) is\nassociated with a parameter wuv representing the strength of in\ufb02uence user u has on v. We assume\nthat the graph structure (the edge set E) is known, while the parameters wuv are to be learned.\nDepending on the diffusion model, there are different ways to represent the strength of in\ufb02uence\nbetween individuals. Nodes can be in one of two states, inactive or active. We say that a node gets\nactivated if it adopts the opinion/product/behavior under the diffusion process. In this work, we focus\non progressive diffusion models, where a node remains active once it gets activated.\nThe diffusion process begins with a set of seed nodes (initial adopters) S, who start active. It then\nproceeds in discrete or continuous time: according to a probabilistic process, additional nodes may\nbecome active based on the in\ufb02uence from their neighbors. Let N (v) be the in-neighbors of node v\nand At the set of nodes activated by time t. We consider the following three diffusion models:\nDiscrete-time Linear Threshold (DLT) model [10]: Each node v has a threshold \u2713v drawn inde-\npendently and uniformly from the interval [0, 1]. The diffusion under the DLT model unfolds in\ndiscrete time steps: a node v becomes active at step t if the total incoming weight from its active\n\nneighbors exceeds its threshold:Pu2N (v)\\At1\nDiscrete-time Independent Cascade (DIC) model [10]: The DIC model is also a discrete-time\nmodel. The weight wuv 2 [0, 1] captures an activation probability. When a node u becomes active in\nstep t, it attempts to activate all currently inactive neighbors in step t + 1. For each neighbor v, it\n\nwuv  \u2713v.\n\n2\n\n\fsucceeds with probability wuv. If it succeeds, v becomes active; otherwise, v remains inactive. Once\nu has made all these attempts, it does not get to make further activation attempts at later times.\nContinuous-time Independent Cascade (CIC) model [8]: The CIC model unfolds in continuous\ntime. Each edge e = (u, v) is associated with a delay distribution with wuv as its parameter. When a\nnode u becomes newly active at time t, for every neighbor v that is still inactive, a delay time duv is\ndrawn from the delay distribution. duv is the duration it takes u to activate v, which could be in\ufb01nite\n(if u does not succeed in activating v). Nodes are considered activated by the process if they are\nactivated within a speci\ufb01ed observation window [0,\u2327 ].\nFix one of the diffusion models de\ufb01ned above and its parameters. For each seed set S, let S\nbe the distribution of \ufb01nal active sets. (In the case of the DIC and DLT model, this is the set of\nactive nodes when no new activations occur; for the CIC model, it is the set of nodes active at\ntime \u2327.) For any node v, let Fv(S) = ProbA\u21e0S [v 2 A] be the (marginal) probability that v\nis activated according to the dynamics of the diffusion model with initial seeds S. Then, de\ufb01ne\nthe in\ufb02uence function F : 2V ! [0, 1]n mapping seed sets to the vector of marginal activation\nprobabilities: F (S) = [F1(S), . . . , Fn(S)]. Notice that the marginal probabilities do not capture the\nfull information about the diffusion process contained in S (since they do not observe co-activation\npatterns), but they are suf\ufb01cient for many applications, such as in\ufb02uence maximization [10] and\nin\ufb02uence estimation [4].\n\nCascades and Incomplete Observations. We focus on the problem of learning in\ufb02uence functions\nfrom cascades. A cascade C = (S, A) is a realization of the random diffusion process; S is the set of\nseeds and A \u21e0 S, A \u25c6 S is the set of activated nodes at the end of the random process. Similar to\nNarasimhan et al. [16], we focus on activation-only observations, namely, we only observe which\nnodes were activated, but not when these activations occurred.1\nTo capture the fact that some of the node activations may have been unobserved, we use the following\nmodel of independently randomly missing data: for each (activated) node v 2 A\\ S, the activation of\nv is actually observed independently with probability r. With probability 1  r, the node\u2019s activation\nis unobservable. For seed nodes v 2 S, the activation is never lost. Formally, de\ufb01ne \u02dcA as follows:\neach v 2 S is deterministically in \u02dcA, and each v 2 A \\ S is in \u02dcA independently with probability r.\nThen, the incomplete cascade is denoted by \u02dcC = (S, \u02dcA).\n\nn \u00b7 ||x  y||2\n\n2.2 Objective Functions and Learning Goals\nTo measure estimation error, we primarily use a quadratic loss function, as in [16, 3]. For two\nn-dimensional vectors x, y, the quadratic loss is de\ufb01ned as `sq(x, y) = 1\n2. We also use\nthis notation when one or both arguments are sets: when an argument is a set S, we formally mean to\nuse the indicator function S as a vector, where S(v) = 1 if v 2 S, and S(v) = 0 otherwise. In\nparticular, for an activated set A, we write `sq(A, F (S)) = 1\nWe now formally de\ufb01ne the problem of learning in\ufb02uence functions from incomplete observations.\nLet P be a distribution over seed sets (i.e., a distribution over 2V ), and \ufb01x a diffusion model M and\nparameters, together giving rise to a distribution S for each seed set. The algorithm is given a set\nof M incomplete cascades \u02dcC = {(S1, \u02dcA1), . . . , (SM , \u02dcAM )}, where each Si is drawn independently\nfrom P, and \u02dcAi is obtained by the incomplete observation process described above from the (random)\nactivated set Ai \u21e0 Si. The goal is to learn an in\ufb02uence function F that accurately captures the\ndiffusion process. Accuracy of the learned in\ufb02uence function is measured in terms of the squared\nerror with respect to the true model: errsq[F ] = ES\u21e0P,A\u21e0S [`sq(A, F (S))]. That is, the expectation\nis over the seed set and the randomness in the diffusion process, but not the data loss process.\n\n2.\nn||A  F (S)||2\n\nPAC Learnability of In\ufb02uence Functions. We characterize the learnability of in\ufb02uence functions\nunder incomplete observations using the Probably Approximately Correct (PAC) learning frame-\nwork [22]. Let FM be the class of in\ufb02uence functions under the diffusion model M, and FL the class\nof in\ufb02uence functions the learning algorithm is allowed to choose from. We say that FM is PAC learn-\nable if there exists an algorithm A with the following property: for all \",  2 (0, 1), all parametriza-\ntions of the diffusion model, and all distributions P over seed sets S: when given activation-only\n1Narasimhan et al. [16] refer to this model as partial observations; we change the terminology to avoid\n\nconfusion with \u201cincomplete observations.\u201d\n\n3\n\n\fand incomplete training cascades \u02dcC = {(S1, \u02dcA1), . . . , (SM , \u02dcAM )} with M  poly(n, m, 1/\", 1/),\nA outputs an in\ufb02uence function F 2F L satisfying Prob \u02dcC[errsq[F ]  errsq[F \u21e4]  \"] \uf8ff .\nHere, F \u21e4 2F M is the ground truth in\ufb02uence function. The probability is over the training cascades,\nincluding the seed set generation, the stochastic diffusion process, and the missing data process.\nWe say that an in\ufb02uence function learning algorithm A is proper if FL \u2713F M; that is, the learned\nin\ufb02uence function is guaranteed to be an instance of the true diffusion model. Otherwise, we say that\nA is an improper learning algorithm.\n3 Proper PAC Learning under Incomplete Observations\nIn this section, we establish proper PAC learnability of in\ufb02uence functions under the DIC and DLT\nmodels. For both diffusion models, FM can be parameterized by an edge parameter vector w, whose\nentries we are the activation probabilities (DIC model) or edge weights (DLT model). Our goal is to\n\ufb01nd an in\ufb02uence function F w 2F M that outputs accurate marginal activation probabilities. While\nour goal is proper learning \u2014 meaning that the function must be from FM \u2014 we do not require\nthat the inferred parameters match the true edge parameters w. Our main theoretical results are\nsummarized in Theorems 1 and 2.\nTheorem 1. Let  2 (0, 0.5). The class of in\ufb02uence functions under the DIC model in which all\nedge activation probabilities satisfy we 2 [, 1  ] is PAC learnable under incomplete observations\nwith retention rate r. The sample complexity2 is \u02dcO( n3m\nTheorem 2. Let  2 (0, 0.5), and consider the class of in\ufb02uence functions under the DLT model such\nthat the edge weight for every edge satis\ufb01es we 2 [, 1], and for every node v, 1Pu2N (v) wuv 2\n[, 1  ]. This class is PAC learnable under incomplete observations with retention rate r. The\nsample complexity is \u02dcO( n3m\n\n\"2r4 ).\n\n\"2r4 ).\n\nIn this section, we present the intuition and a proof sketch for the two theorems. Details of the proof\nare provided in Appendix B.\nThe key idea of the proof of both theorems is that a set of incomplete cascades \u02dcC on G under the two\nmodels can be considered as essentially complete cascades on a transformed graph \u02c6G = ( \u02c6V , \u02c6E). The\nin\ufb02uence functions of nodes in \u02c6G can be learned using a modi\ufb01cation of the result of Narasimhan et\nal. [16]. Subsequently, the in\ufb02uence functions for G are directly obtained from the in\ufb02uence functions\nfor \u02c6G, by exploiting that in\ufb02uence functions only focus on the marginal activation probabilities.\nThe transformed graph \u02c6G is built by adding a layer of n nodes to the graph G. For each node v 2 V\nof the original graph, we add a new node v0 2 V 0 and a directed edge (v, v0) with known and \ufb01xed\nedge parameter \u02c6wvv0 = r. (The same parameter value serves as activation probability under the DIC\nmodel and as edge weight under the DLT model.) The new nodes V 0 have no other incident edges,\nand we retain all edges e = (u, v) 2 E. Inferring their parameters is the learning task.\nFor each observed (incomplete) cascade (Si, \u02dcAi) on G (with Si \u2713 \u02dcAi), we produce an observed\nactivation set A0i as follows: (1) for each v 2 \u02dcAi \\ Si, we let v0 2 A0i deterministically; (2) for each\nv 2 Si independently, we include v0 2 A0i with probability r. This de\ufb01nes the training cascades\n\u02c6C = {(Si, A0i)}.\nNow consider any edge parameters w, applied to both G and the \ufb01rst layer of \u02c6G. Let F (S) denote\nthe in\ufb02uence function on G, and \u02c6F (S) = [ \u02c6F10(S), . . . , \u02c6Fn0(S)] the in\ufb02uence function of the nodes\nin the added layer V 0 of \u02c6G. Then, by the transformation, for all nodes v 2 V , we get that\n\n\u02c6Fv0(S) = r \u00b7 Fv(S).\n\n(1)\nAnd by the de\ufb01nition of the observation loss process, Prob[v 2 \u02dcAi] = r \u00b7 Fv(S) = \u02c6Fv0(S) for all\nnon-seed nodes v /2 Si.\nWhile the cascades \u02c6C are not complete on all of \u02c6G, in a precise sense, they provide complete\ninformation on the activation of nodes in V 0. In Appendix B, we show that Theorem 2 of Narasimhan\net al. [16] can be extended to provide identical guarantees for learning \u02c6F (S) from the modi\ufb01ed\n\n2The \u02dcO notation suppresses poly-logarithmic dependence on 1/, 1/, n, and m.\n\n4\n\n\fobserved cascades \u02c6C. For the DIC model, this is a straightforward modi\ufb01cation of the proof from\n[16]. For the DLT model, [16] had not established PAC learnability3, so we provide a separate proof.\nBecause the results of [16] and our generalizations ensure proper learning, they provide edge\nweights w between the nodes of V . We use these exact same edge weights to de\ufb01ne the learned\nin\ufb02uence functions in G. Equation (1) then implies that the learned in\ufb02uence functions on V satisfy\nr \u00b7 \u02c6Fv0(S). The detailed analysis in Appendix B shows that the learning error only scales\nFv(S) = 1\nby a multiplicative factor 1\nr2 .\nThe PAC learnability result shows that there is no information-theoretical obstacle to in\ufb02uence\nfunction learning under incomplete observations. However, it does not imply an ef\ufb01cient algorithm.\nThe reason is that a hidden variable would be associated with each node not observed to be active,\nand computing the objective function for empirical risk minimization would require marginalizing\nover all of the hidden variables. The proper PAC learnability result also does not readily generalize to\nthe CIC model and other diffusion models, even under complete observations. This is due to the lack\nof a succinct characterization of in\ufb02uence functions as under the DIC and DLT models. Therefore,\nin the next section, we explore improper learning approaches with the goal of designing practical\nalgorithms and establishing learnability under a broader class of diffusion models.\n4 Ef\ufb01cient Improper Learning Algorithm\nInstead of parameterizing the in\ufb02uence functions using the edge parameters, we adopt the model-free\nin\ufb02uence function learning framework, In\ufb02uLearner, proposed by Du et al. [3] to represent the\nin\ufb02uence function as a sum of weighted basis functions. From now on, we focus on the in\ufb02uence\nfunction Fv(S) of a single \ufb01xed node v.\n\nIn\ufb02uence Function Parameterization. For all three diffusion models (CIC, DIC and DLT), the\ndiffusion process can be characterized equivalently using live-edge graphs. Concretely, the results\nof [10, 4] state that for each instance of the CIC, DIC, and DLT models, there exists a distribution\n over live-edge graphs H assigning probability H to each live-edge graph H such that F \u21e4v (S) =\n\nTo reduce the representation complexity, notice that from the perspective of activating v, two different\nlive-edge graphs H, H0 are \u201cequivalent\u201d if v is reachable from exactly the same nodes in H and\n\nPH:at least one node in S has a path to v in H H.\nH0. Therefore, for any node set T , let \u21e4T :=PH:exactly the nodes in T have paths to v in H H. We then use\ncharacteristic vectors as feature vectors rT = T , where we will interpret the entry for node u\nas u having a path to v in a live-edge graph. More precisely, let (x) = min{x, 1}, and S the\ncharacteristic vector of the seed set S. Then, (>S \u00b7 rT ) = 1 if and only if v is reachable from S,\nand we can write F \u21e4v (S) =PT \u21e4T \u00b7 (>S \u00b7 rT ).\nThis representation still has exponentially many features (one for each T ). In order to make the learn-\ning problem tractable, we sample a smaller set T of K features from a suitably chosen distribution,\nimplicitly setting the weights T of all other features to 0. Thus, we will parametrize the learned\nin\ufb02uence function as F \nThe goal is then to learn weights T for the sampled features. (They will form a distribution, i.e.,\n||||1 = 1 and   0.) The crux of the analysis is to show that a suf\ufb01ciently small number K of\nfeatures (i.e., sampled sets) suf\ufb01ces for a good approximation, and that the weights can be learned\nef\ufb01ciently from a limited number of observed incomplete cascades. Speci\ufb01cally, we consider the\nlog likelihood function `(t, y) = y log t + (1  y) log(1  t), and learn the parameter vector (a\ndistribution) by maximizing the likelihoodPM\n\nv (S) =PT2T T \u00b7 (>S \u00b7 rT ).\n\ni=1 `(F \n\nv (Si), Ai(v)).\n\nHandling Incomplete Observations. The maximum likelihood estimation cannot be directly ap-\nplied to incomplete cascades, as we do not have access to Ai (only the incomplete version \u02dcAi). To\naddress this issue, notice that the MLE problem is actually a binary classi\ufb01cation problem with log\nloss and yi = Ai(v) as the label. From this perspective, incompleteness is simply class-conditional\nnoise on the labels. Let \u02dcyi =  \u02dcAi\n(v) be our observation of whether v was activated or not under\nthe incomplete cascade i. Then, Prob[\u02dcyi = 1|yi = 1] = r and Prob[\u02dcyi = 1|yi = 0] = 0. In words,\n3[16] shows that the DLT model with \ufb01xed thresholds is PAC learnable under complete cascades. We study\n\nthe DLT model when the thresholds are uniformly distributed random variables.\n\n5\n\n\fthe incomplete observation \u02dcyi suffers from one-sided error compared to the complete observation yi.\nBy results of Natarajan et al. [17], we can construct an unbiased estimator of `(t, y) using only the\nincomplete observations \u02dcy, as in the following lemma.\nLemma 3 (Corollary of Lemma 1 of [17]). Let y be the true activation of node v and \u02dcy the\nincomplete observation. Then, de\ufb01ning \u02dc`(t, y) := 1\n(1  y) log(1  t), for any t, we\n\nr y log t + 2r1\n\nr\n\nhave E\u02dcyh\u02dc`(t, \u02dcy)i = `(t, y).\n\nBased on this lemma, we arrive at the \ufb01nal algorithm of solving the maximum likelihood estimation\nproblem with the adjusted likelihood function \u02dc`(t, y):\n\nMaximize PM\n\nsubject to\n\ni=1\n\n\u02dc`(F \n\nv (Si), \u02dcAi\n||||1 = 1,   0.\n\n(v))\n\n(2)\n\nv\n\nv\n\nv (S) + .\n\n(S) = (1  2)F \n\nWe analyze conditions under which the solution to (2) provides improper PAC learnability under\nincomplete observations; these conditions will apply for all three diffusion models.\nThese conditions are similar to those of Lemma 1 in the work of Du et al. [3], and concern the\napproximability of the reachability distribution \u21e4T . Speci\ufb01cally, let q be a distribution over node\nsets T such that q(T ) \uf8ff C \u21e4T for all node sets T . Here, C is a (possibly very large) number that we\nwill want to bound below. Let T1, . . . , TK be K i.i.d. samples drawn from the distribution q. The\n(S) with parameter4 \nfeatures are then rk = Tk. We use the truncated version of the function F ,\nas in [3]: F ,\nLet M be the class of all such truncated in\ufb02uence functions, and F \u02dc,\n2M  the in\ufb02uence\nfunctions obtained from the optimization problem (2). The following theorem (proved in Appendix C)\nestablishes the accuracy of the learned functions.\nTheorem 4. Assume that the learning algorithm uses K = \u02dc\u2326( C2\nit constructs, and observes5 M = \u02dc\u2326( log C\nprobability at least 1  , the learned in\ufb02uence functions F \u02dc,\nP satisfy ES\u21e0Ph(F \u02dc,\n\n\"2 ) features in the in\ufb02uence function\n\"4r2 ) incomplete cascades with retention rate r. Then, with\nfor each node v and seed distribution\n\n(S)  F \u21e4v (S))2i \uf8ff \".\n\nThe theorem implies that with enough incomplete cascades, an algorithm can approximate the ground\ntruth in\ufb02uence function to arbitrary accuracy. Therefore, all three diffusion models are improperly\nPAC learnable under incomplete observations. The \ufb01nal sample complexity does not contain the graph\nsize, but is implicitly captured by C, which will depend on the graph and how well the distribution\n\u21e4T can be approximated. Notice that with r = 1, our bound on M has logarithmic dependency on C\ninstead of polynomial, as in [3]. The reason for this improvement is discussed further in Appendix C.\n\nv\n\nv\n\nv\n\nEf\ufb01cient Implementation. As mentioned above, the features T cannot be sampled from the exact\nreachability distribution \u21e4T , because it is inaccessible (and complex). In order to obtain useful\nguarantees from Theorem 4, we follow the approach of Du et al. [3], and approximate the distribution\n\u21e4T with the product of the marginal distributions, estimated from observed cascades.\nThe optimization problem (2) is convex and can therefore be solved in time polynomial in the number\nof features K. However, the guarantees in Theorem 4 require a possibly large number of features. In\norder to obtain an ef\ufb01cient algorithm for practical use and our experiments, we sacri\ufb01ce the guarantee\nand use a \ufb01xed number of features.\n5 Experiments\nIn this section, we experimentally evaluate the algorithm from Section 4. Since no other methods\nexplicitly account for incomplete observations, we compare it to several state-of-the-art methods\nfor in\ufb02uence function learning with full information. Hence, the main goal of the comparison is to\nexamine to what extent the impact of missing data can be mitigated by being aware of it. We compare\n\n4The proof of Theorem 4 in Appendix C will show how to choose .\n5The \u02dc\u2326 notation suppresses all logarithmic terms except log C, as C could be exponential or worse in the\n\nnumber of nodes.\n\n6\n\n\f(b) DIC\n\n(c) DLT\n\n(a) CIC\n\nFigure 1: MAE of estimated in\ufb02uence as a function of the retention rate on synthetic datasets for (a)\nCIC model, (b) DIC model, (c) DLT model. The error bars show one standard deviation.\nour algorithm to the following approaches: (1) CIC \ufb01ts the parameters of a CIC model, using the\nNetRate algorithm [7] with exponential delay distribution. (2) DIC \ufb01ts the activation probabilities\nof a DIC model using the method in [18]. (3) In\ufb02uLearner is the model-free approach proposed\nby Du et al. in [3] and discussed in Section 4. (4) Logistic uses logistic regression to learn the\nin\ufb02uence functions Fu(S) = f (>S \u00b7 cu + b) for each u independently, where cu is a learnable weight\n1+ex is the logistic function. (5) Linear uses linear regression to learn the total\nvector and f (x) = 1\nin\ufb02uence (S) = c> \u00b7 S + b of the set S. Notice that the CIC and DIC methods have access to\nthe activation time of each node in addition to the \ufb01nal activation status, giving them an inherent\nadvantage.\n\n5.1 Synthetic cascades\nData generation. We generate synthetic networks with core-peripheral structure following the\nKronecker graph model [12] with parameter matrix [0.9, 0.5; 0.5, 0.3].6 Each generated network\nhas 512 nodes and 1024 edges. Subsequently, we generate 8192 cascades as training data using\nthe CIC, DIC and DLT models, with random seed sets whose sizes are power law distributed. The\nretention rate is varied between 0.1 and 0.9. The test set contains 200 independently sampled seed\nsets generated in the same way as the training data. Details of the data generation process are provided\nin Appendix A.\n\nAlgorithm settings. We apply all algorithms to cascades generated from all three models; that\nis, we also consider the results under model misspeci\ufb01cation. Whenever applicable, we set the\nhyperparameters of the \ufb01ve comparison algorithms to the ground truth values. When applying the\nNetRate algorithm to discrete-time cascades, we set the observation window to 10.0. When applying\nthe method in [18] to continuous-time cascades, we round activation times up to the nearest multiple\nof 0.1, resulting in 10 discrete time steps. For the model-free approaches (In\ufb02uLearner and our\nalgorithm), we use K = 200 features.\n\nResults. Figure 1 shows the Mean Absolute Error (MAE) between the estimated total in\ufb02uence\n(S) and the true in\ufb02uence value, averaged over all testing seed sets. For each setting (diffusion\nmodel and retention rate), the reported MAE is averaged over \ufb01ve independent runs.\nThe main insight is that accounting for missing observations indeed strongly mitigates their effect:\nnotice that for retention rates as small as 0.5, our algorithm can almost completely compensate for\nthe data loss, whereas both the model-free and parameter \ufb01tting approaches deteriorate signi\ufb01cantly\neven for retention rates close to 1. For the parameter \ufb01tting approaches, even such large retention\nrates can lead to missing and spurious edges in the inferred networks, and thus signi\ufb01cant estimation\nerrors. Additional observations include that \ufb01tting in\ufb02uence using (linear or logistic) regression\ndoes not perform well at all, and that the CIC inference approach appears more robust to model\nmisspeci\ufb01cation than the DIC approach.\n\nSensitivity of retention rate. We presented the algorithms as knowing r. Since r itself is inferred\nfrom noisy data, it may be somewhat misestimated. Figure 2 shows the impact of misestimating r.\nWe generate synthetic cascades from all three diffusion models with a true retention rate of 0.8, and\n6We also experimented on Kronecker graphs with hierarchical community structure ([0.9, 0.1; 0.1, 0.9]) and\n\nrandom structure ([0.5, 0.5; 0.5, 0.5]). The results are similar and omitted due to space constraints.\n\n7\n\n\f40\n35\n30\n25\n20\n15\n10\n5\n0\n!5\n!10\n\nCIC\n\nDIC\n\nDLT\n\nLinear\n\nLogistic DIC\n\nCIC\n\nInfluLearner Our3Method\n\nE\nA\nM\n\n35\n30\n25\n20\n15\n10\n5\n0\n\n1\n\n2\n\n3\nGroups2of2memes\n\n4\n\n5\n\n6\n\n7\n\n1\n\n0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6\n\nFigure 3: MAE of in\ufb02uence estimation on seven\nsets of real-world cascades with 20% of activa-\ntions missing.\n\nFigure 2: Relative error in MAE under reten-\ntion rate misspeci\ufb01cation. x-axis: retention rate\nr used by the algorithm. y-axis: relative dif-\nference of MAE compared to using the true\nretention rate 0.8.\nthen apply our algorithm with (incorrect) retention rate r 2{ 0.6, 0.65, . . . , 0.95, 1}. The results are\naveraged over \ufb01ve independent runs. While the performance decreases as the misestimation gets\nworse (after all, with r = 1, the algorithm is basically the same as In\ufb02uLearner), the degradation is\ngraceful.\n\nIn\ufb02uence Estimation on real cascades\n\n5.2\nWe further evaluate the performance of our method on the real-world MemeTracker7 dataset [11].\nThe dataset consists of the propagation of short textual phrases, referred to as Memes, via the\npublication of blog posts and main-stream media news articles between March 2011 and February\n2012. Speci\ufb01cally, the dataset contains seven groups of cascades corresponding to the propagation\nof Memes with certain keywords, namely \u201capple and jobs\u201d, \u201ctsunami earthquake\u201d, \u201cwilliam kate\nmarriage\u201d\u2019, \u201coccupy wall-street\u201d, \u201cairstrikes\u201d, \u201cegypt\u201d and \u201celections.\u201d Each cascade group consists\nof 1000 nodes, with a number of cascades varying from 1000 to 44000. We follow exactly the same\nevaluation method as Du et al. [3] with a training/test set split of 60%/40%.\nTo test the performance of in\ufb02uence function learning under incomplete observations, we randomly\ndelete 20% of the occurrences, setting r = 0.8. The results for other retention rates are similar and\nomitted. Figure 3 shows the MAE of our methods and the \ufb01ve baselines, averaged over 100 random\ndraws of test seed sets, for all groups of memes. While some baselines perform very poorly, even\ncompared to the best baseline (In\ufb02uLearner), our algorithm provides an 18% reduction in MAE\n(averaged over the seven groups), showing the potential of data loss awareness to mitigate its effects.\n\n6 Extensions and Future Work\nIn the full version available on arXiv, we show both experimentally and theoretically how to generalize\nour results to non-uniform (but independent) loss of node activations, and how to deal with a\nmisestimation of the retention rate r. Any non-trivial partial information about r leads to positive\nPAC learnability results.\nA much more signi\ufb01cant departure for future work would be dependent loss of activations, e.g., losing\nall activations of some randomly chosen nodes. As another direction, it would be worthwhile to\ngeneralize the PAC learnability results to other diffusion models, and to design an ef\ufb01cient algorithm\nwith PAC learning guarantees.\nAcknowledgments\nWe would like to thank anonymous reviewers for useful feedback. The research was sponsored\nin part by NSF research grant IIS-1254206 and by the U.S. Defense Advanced Research Projects\nAgency (DARPA) under the Social Media in Strategic Communication (SMISC) program, Agreement\nNumber W911NF-12-1-0034. The views and conclusions are those of the authors and should not be\ninterpreted as representing the of\ufb01cial policies of the funding agency or the U.S. Government.\n\n7We use the preprocessed version of the dataset released by Du et al. [3] and available at http://www.cc.\ngatech.edu/~ndu8/InfluLearner.html. Notice that the dataset is semi-real, as multi-node seed cascades\nare arti\ufb01cially created by merging single-node seed cascades.\n\n8\n\n\fReferences\n[1] K. Amin, H. Heidari, and M. Kearns. Learning from contagion (without timestamps). In Proc.\n\n31st ICML, pages 1845\u20131853, 2014.\n\n[2] F. Chierichetti, D. Liben-Nowell, and J. M. Kleinberg. Reconstructing Patterns of Information\n\nDiffusion from Incomplete Observations. In Proc. 23rd NIPS, pages 792\u2013800, 2011.\n\n[3] N. Du, Y. Liang, M.-F. Balcan, and L. Song. In\ufb02uence Function Learning in Information\n\nDiffusion Networks. In Proc. 31st ICML, pages 2016\u20132024, 2014.\n\n[4] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. Scalable In\ufb02uence Estimation in Continuous-\n\nTime Diffusion Networks. In Proc. 25th NIPS, pages 3147\u20133155, 2013.\n\n[5] N. Du, L. Song, S. Yuan, and A. J. Smola. Learning Networks of Heterogeneous In\ufb02uence. In\n\nProc. 24th NIPS, pages 2780\u20132788. 2012.\n\n[6] Q. Duong, M. P. Wellman, and S. P. Singh. Modeling information diffusion in networks with\n\nunobserved links. In SocialCom/PASSAT, pages 362\u2013369, 2011.\n\n[7] M. Gomez-Rodriguez, D. Balduzzi, and B. Sch\u00f6lkopf. Uncovering the temporal dynamics of\n\ndiffusion networks. In Proc. 28th ICML, pages 561\u2013568, 2011.\n\n[8] M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and in\ufb02uence.\n\nACM Transactions on Knowledge Discovery from Data, 5(4), 2012.\n\n[9] A. Goyal, F. Bonchi, and L. V. S. Lakshmanan. Learning in\ufb02uence probabilities in social\n\nnetworks. In Proc. 3rd WSDM, pages 241\u2013250, 2010.\n\n[10] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the Spread of In\ufb02uence in a Social Network.\n\nIn Proc. 9th KDD, pages 137\u2013146, 2003.\n\n[11] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news\n\ncycle. In Proc. 15th KDD, pages 497\u2013506, 2009.\n\n[12] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs:\nAn approach to modeling networks. The Journal of Machine Learning Research, 11:985\u20131042,\n2010.\n\n[13] A. Lokhov. Reconstructing parameters of spreading models from partial observations. In Proc.\n\n[14] S. A. Myers and J. Leskovec. On the convexity of latent social network inference. In Proc.\n\n28th NIPS, pages 3459\u20133467, 2016.\n\n22nd NIPS, pages 1741\u20131749, 2010.\n\nIn Proc. 18th KDD, pages 33\u201341, 2012.\n\n27th NIPS, pages 3168\u20133176, 2015.\n\nProc. 25th NIPS, pages 1196\u20131204, 2013.\n\nSigmetrics, pages 211\u2013222, 2012.\n\n[15] S. A. Myers, C. Zhu, and J. Leskovec. Information Diffusion and External In\ufb02uence in Networks.\n\n[16] H. Narasimhan, D. C. Parkes, and Y. Singer. Learnability of In\ufb02uence in Networks. In Proc.\n\n[17] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In\n\n[18] N. Praneeth and S. Sujay. Learning the Graph of Epidemic Cascades. In Proc. 12th ACM\n\n[19] D. Quang, W. M. P, and S. S. P. Modeling Information Diffusion in Networks with Unobserved\n\n[20] N. Rosenfeld, M. Nitzan, and A. Globerson. Discriminative learning of infection models. In\n\nLinks. In SocialCom, pages 362\u2013369, 2011.\n\nProc. 9th WSDM, pages 563\u2013572, 2016.\n\n[21] E. Sadikov, M. Medina, J. Leskovec, and H. Garcia-Molina. Correcting for missing data in\n\ninformation cascades. In Proc. 4th WSDM, pages 55\u201364, 2011.\n\n[22] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142,\n\n1984.\n\n[23] X. Wu, A. Kumar, D. Sheldon, and S. Zilberstein. Parameter learning for latent network\n\ndiffusion. In Proc. 29th IJCAI, pages 2923\u20132930, 2013.\n\n[24] S.-H. Yang and H. Zha. Mixture of Mutually Exciting Processes for Viral Diffusion. In Proc.\n\n30th ICML, pages 1\u20139, 2013.\n\n[25] K. Zhou, H. Zha, and L. Song. Learning Social Infectivity in Sparse Low-rank Network Using\n\nMulti-dimensional Hawkes Processes. In Proc. 30th ICML, pages 641\u2013649, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1093, "authors": [{"given_name": "Xinran", "family_name": "He", "institution": "USC"}, {"given_name": "Ke", "family_name": "Xu", "institution": "USC"}, {"given_name": "David", "family_name": "Kempe", "institution": "USC"}, {"given_name": "Yan", "family_name": "Liu", "institution": "University of Southern California"}]}