{"title": "Learning Time-Varying Coverage Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 3374, "page_last": 3382, "abstract": "Coverage functions are an important class of discrete functions that capture laws of diminishing returns. In this paper, we propose a new problem of learning time-varying coverage functions which arise naturally from applications in social network analysis, machine learning, and algorithmic game theory. We develop a novel parametrization of the time-varying coverage function by illustrating the connections with counting processes. We present an efficient algorithm to learn the parameters by maximum likelihood estimation, and provide a rigorous theoretic analysis of its sample complexity. Empirical experiments from information diffusion in social network analysis demonstrate that with few assumptions about the underlying diffusion process, our method performs significantly better than existing approaches on both synthetic and real world data.", "full_text": "Learning Time-Varying Coverage Functions\n\nNan Du\u2020, Yingyu Liang\u2021, Maria-Florina Balcan(cid:5), Le Song\u2020\n\n\u2020College of Computing, Georgia Institute of Technology\n\u2021Department of Computer Science, Princeton University\n(cid:5)School of Computer Science, Carnegie Mellon University\ndunan@gatech.edu,yingyul@cs.princeton.edu\n\nninamf@cs.cmu.edu,lsong@cc.gatech.edu\n\nAbstract\n\nCoverage functions are an important class of discrete functions that capture the\nlaw of diminishing returns arising naturally from applications in social network\nanalysis, machine learning, and algorithmic game theory. In this paper, we pro-\npose a new problem of learning time-varying coverage functions, and develop a\nnovel parametrization of these functions using random features. Based on the con-\nnection between time-varying coverage functions and counting processes, we also\npropose an ef\ufb01cient parameter learning algorithm based on likelihood maximiza-\ntion, and provide a sample complexity analysis. We applied our algorithm to the\nin\ufb02uence function estimation problem in information diffusion in social networks,\nand show that with few assumptions about the diffusion processes, our algorithm\nis able to estimate in\ufb02uence signi\ufb01cantly more accurately than existing approaches\non both synthetic and real world data.\n\nIntroduction\n\n1\nCoverage functions are a special class of the more general submodular functions which play impor-\ntant role in combinatorial optimization with many interesting applications in social network anal-\nysis [1], machine learning [2], economics and algorithmic game theory [3], etc. A particularly\nimportant example of coverage functions in practice is the in\ufb02uence function of users in information\ndiffusion modeling [1] \u2014 news spreads across social networks by word-of-mouth and a set of in\ufb02u-\nential sources can collectively trigger a large number of follow-ups. Another example of coverage\nfunctions is the valuation functions of customers in economics and game theory [3] \u2014 customers are\nthought to have certain requirements and the items being bundled and offered ful\ufb01ll certain subsets\nof these demands.\nTheoretically, it is usually assumed that users\u2019 in\ufb02uence or customers\u2019 valuation are known in ad-\nvance as an oracle. In practice, however, these functions must be learned. For example, given past\ntraces of information spreading in social networks, a social platform host would like to estimate\nhow many follow-ups a set of users can trigger. Or, given past data of customer reactions to differ-\nent bundles, a retailer would like to estimate how likely customer would respond to new packages of\ngoods. Learning such combinatorial functions has attracted many recent research efforts from both\ntheoretical and practical sides (e.g., [4, 5, 6, 7, 8]), many of which show that coverage functions can\nbe learned from just polynomial number of samples.\nHowever, the prior work has widely ignored an important dynamic aspect of the coverage functions.\nFor instance, information spreading is a dynamic process in social networks, and the number of\nfollow-ups of a \ufb01xed set of sources can increase as observation time increases. A bundle of items\nor features offered to customers may trigger a sequence of customer actions over time. These real\nworld problems inspire and motivate us to consider a novel time-varying coverage function, f (S, t),\nwhich is a coverage function of the set S when we \ufb01x a time t, and a continuous monotonic function\nof time t when we \ufb01x a set S. While learning time-varying combinatorial structures has been ex-\n\n1\n\n\fplored in graphical model setting (e.g., [9, 10]), as far as we are aware of, learning of time-varying\ncoverage function has not been addressed in the literature. Furthermore, we are interested in esti-\nmating the entire function of t, rather than just treating the time t as a discrete index and learning\nthe function value at a small number of discrete points. From this perspective, our formulation is the\ngeneralization of the most recent work [8] with even less assumptions about the data used to learn\nthe model.\nGenerally, we assume that the historical data are provided in pairs of a set and a collection of times-\ntamps when caused events by the set occur. Hence, such a collection of temporal events associated\nwith a particular set Si can be modeled principally by a counting process Ni(t), t (cid:62) 0 which is a\nstochastic process with values that are positive, integer, and increasing along time [11]. For instance,\nin the information diffusion setting of online social networks, given a set of earlier adopters of some\nnew product, Ni(t) models the time sequence of all triggered events of the followers, where each\njump in the process records the timing tij of an action. In the economics and game theory setting, the\ncounting process Ni(t) records the number of actions a customer has taken over time given a partic-\nular bundled offer. This essentially raises an interesting question of how to estimate the time-varying\ncoverage function from the angle of counting processes. We thus propose a novel formulation which\nbuilds a connection between the two by modeling the cumulative intensity function of a counting\nprocess as a time-varying coverage function. The key idea is to parametrize the intensity function\nas a weighted combination of random kernel functions. We then develop an ef\ufb01cient learning algo-\nrithm TCOVERAGELEARNER to estimate the parameters of the function using maximum likelihood\napproach. We show that our algorithm can provably learn the time-varying coverage function using\nonly polynomial number of samples. Finally, we validate TCOVERAGELEARNER on both in\ufb02uence\nestimation and maximization problems by using cascade data from information diffusion. We show\nthat our method performs signi\ufb01cantly better than alternatives with little prior knowledge about the\ndynamics of the actual underlying diffusion processes.\n\n2 Time-Varying Coverage Function\nWe will \ufb01rst give a formal de\ufb01nition of the time-varying coverage function, and then explain its\nadditional properties in details.\nDe\ufb01nition. Let U be a (potentially uncountable) domain. We endow U with some \u03c3-algebra A and\ndenote a probability distribution on U by P. A coverage function is a combinatorial function over a\n\ufb01nite set V of items, de\ufb01ned as\n\n(1)\nwhere Us \u2282 U is the subset of domain U covered by item s \u2208 V, and Z is the additional nor-\nmalization constant. For time-varying coverage functions, we let the size of the subset Us to grow\nmonotonically over time, that is\n\n,\n\ns\u2208S Us\n\nfor all S \u2208 2V ,\n\nf (S) := Z \u00b7 P(cid:16)(cid:91)\n\n(cid:17)\n\nfor all t (cid:54) \u03c4 and s \u2208 V,\n\n(2)\n\nwhich results in a combinatorial temporal function\n\nUs(t) \u2286 Us(\u03c4 ),\n\nf (S, t) = Z \u00b7 P(cid:16)(cid:91)\n\n(cid:17)\n\n,\n\nfor all S \u2208 2V .\n\ns\u2208S Us(t)\n(3)\nIn this paper, we assume that f (S, t) is smooth and continuous, and its \ufb01rst order derivative with\nrespect to time, f(cid:48)(S, t), is also smooth and continuous.\nRepresentation. We now show that a time-varying coverage function, f (S, t), can be represented\nas an expectation over random functions based on multidimensional step basis functions. Since\nUs(t) is varying over time, we can associate each u \u2208 U with a |V|-dimensional vector \u03c4u of change\npoints. In particular, the s-th coordinate of \u03c4u records the time that source node s covers u. Let \u03c4\nto be a random variable obtained by sampling u according to P and setting \u03c4 = \u03c4u. Note that given\nall \u03c4u we can compute f (S, t); now we claim that the distribution of \u03c4 is suf\ufb01cient.\nWe \ufb01rst introduce some notations. Based on \u03c4u we de\ufb01ne a |V|-dimensional step function ru(t) :\nR+ (cid:55)\u2192 {0, 1}|V|\n, where the s-th dimension of ru(t) is 1 if u is covered by the set Us(t) at time t, and\n0 otherwise. To emphasize the dependence of the function ru(t) on \u03c4u, we will also write ru(t) as\nru(t|\u03c4u). We denote the indicator vector of a set S by \u03c7S \u2208 {0, 1}|V| where the s-th dimension of\nS ru(t) (cid:62) 1.\n\n\u03c7S is 1 if s \u2208 S, and 0 otherwise. Then u \u2208 U is covered by(cid:83)\n\ns\u2208S Us(t) at time t if \u03c7(cid:62)\n\n2\n\n\fLemma 1. There exists a distribution Q(\u03c4 ) over the vector of change points \u03c4 , such that the time-\nvarying coverage function can be represented as\n\nf (S, t) = Z \u00b7 E\u03c4\u223cQ(\u03c4 )\n\n(cid:2)\u03c6(\u03c7(cid:62)\nS r(t|\u03c4 ))(cid:3)\n\n(4)\n\nwhere \u03c6(x) := min{x, 1}, and r(t|\u03c4 ) is a multidimensional step function parameterized by \u03c4 .\ns\u2208S Us(t). By de\ufb01nition (3), we have the following integral representation\nI{u \u2208 US} dP(u) = Z \u00b7\n\nf (S, t) = Z \u00b7\nWe can de\ufb01ne the set of u having the same \u03c4 as U\u03c4 := {u \u2208 U | \u03c4u = \u03c4} and de\ufb01ne a distribution\n\n(cid:2)\u03c6(\u03c7(cid:62)\nS ru(t))(cid:3) .\n\nProof. Let US :=(cid:83)\n(cid:90)\nover \u03c4 as dQ(\u03c4 ) :=(cid:82)\n\ndP(u). Then the integral representation of f (S, t) can be rewritten as\n\n\u03c6(\u03c7(cid:62)\n\nS ru(t)) dP(u) = Z \u00b7 Eu\u223cP(u)\n\n(cid:2)\u03c6(\u03c7(cid:62)\nS ru(t))(cid:3) = Z \u00b7 E\u03c4\u223cQ(\u03c4 )\n\n(cid:2)\u03c6(\u03c7(cid:62)\nS r(t|\u03c4 ))(cid:3) ,\n\nU\u03c4\nZ \u00b7 Eu\u223cP(u)\n\n(cid:90)\n\nU\n\nU\n\nwhich proves the lemma.\n\n3 Model for Observations\nIn general, we assume that the input data are provided in the form of pairs, (Si, Ni(t)), where Si is\na set, and Ni(t) is a counting process in which each jump of Ni(t) records the timing of an event.\nWe \ufb01rst give a brief overview of a counting process [11] and then motivate our model in details.\nCounting Process. Formally, a counting process {N (t), t (cid:62) 0} is any nonnegative, integer-valued\nstochastic process such that N (t(cid:48)) (cid:54) N (t) whenever t(cid:48) (cid:54) t and N (0) = 0. The most common\nuse of a counting process is to count the number of occurrences of temporal events happening along\ntime, so the index set is usually taken to be the nonnegative real numbers R+. A counting process\nis a submartingale: E[N (t)|Ht(cid:48)] (cid:62) N (t(cid:48)) for all t > t(cid:48) where Ht(cid:48) denotes the history up to time t(cid:48).\nBy Doob-Meyer theorem [11], N (t) has the unique decomposition:\n\n(5)\nwhere \u039b(t) is a nondecreasing predictable process called the compensator (or cumulative intensity),\nand M (t) is a mean zero martingale. Since E[dM (t)|Ht\u2212 ] = 0, where dM (t) is the increment of\nM (t) over a small time interval [t, t + dt), and Ht\u2212 is the history until just before time t,\n\nN (t) = \u039b(t) + M (t)\n\nE[dN (t)|Ht\u2212 ] = d\u039b(t) := a(t) dt\n\n(6)\n\nwhere a(t) is called the intensity of a counting process.\nModel formulation. We assume that the cumulative intensity of the counting process is modeled\nby a time-varying coverage function, i.e., the observation pair (Si, Ni(t)) is generated by\n\nNi(t) = f (Si, t) + Mi(t)\n\n(7)\nin the time window [0, T ] for some T > 0, and df (S, t) = a(S, t)dt. In other words, the time-\nvarying coverage function controls the propensity of occurring events over time. Speci\ufb01cally, for a\n\ufb01xed set Si, as time t increases, the cumulative number of events observed grows accordingly for\nthat f (Si, t) is a continuous monotonic function over time; for a given time t, as the set Si changes\nto another set Sj, the amount of coverage over domain U may change and hence can result in a\ndifferent cumulative intensity. This abstract model can be mapped to real world applications. In\nthe information diffusion context, for a \ufb01xed set of sources Si, as time t increases, the number of\nin\ufb02uenced nodes in the social network tends to increase; for a given time t, if we change the sources\nto Sj, the number of in\ufb02uenced nodes may be different depending on how in\ufb02uential the sources\nare. In the economics and game theory context, for a \ufb01xed bundle of offers Si, as time t increases, it\nis more likely that the merchant will observe the customers\u2019 actions in response to the offers; even\nat the same time t, different bundles of offers, Si and Sj, may have very different ability to drive the\ncustomers\u2019 actions.\nCompared to a regression model yi = g(Si) + \u0001i with i.i.d. input data (Si, yi), our model outputs\na special random function over time, that is, a counting process Ni(t) with the noise being a zero\nmean martingale Mi(t). In contrast to functional regression models, our model exploits much more\ninteresting structures of the problem. For instance, the random function representation in the last\nsection can be used to parametrize the model. Such special structure of the counting process allows\nus to estimate the parameter of our model using maximum likelihood approach ef\ufb01ciently, and the\nmartingale noise enables us to use exponential concentration inequality in analyzing our algorithm.\n\n3\n\n\f4 Parametrization\nBased on the following two mild assumptions, we will show how to parametrize the intensity func-\ntion as a weighted combination of random kernel functions, learn the parameters by maximum\nlikelihood estimation, and eventually derive a sample complexity.\n\nis absolutely continuous with(cid:82) \u00a8a(t)dt < \u221e.\n\n(A1) a(S, t) is smooth and bounded on [0, T ]: 0 < amin (cid:54) a (cid:54) amax < \u221e, and \u00a8a := d2a/dt2\n(A2) There is a known distribution Q(cid:48)(\u03c4 ) and a constant C with Q(cid:48)(\u03c4 )/C (cid:54) Q(\u03c4 ) (cid:54) CQ(cid:48)(\u03c4 ).\nKernel Smoothing To facilitate our \ufb01nite dimensional parameterization, we \ufb01rst convolve the\n\u221a\nintensity function with K(t) = k(t/\u03c3)/\u03c3 where \u03c3 is the bandwidth parameter and k is a kernel\nfunction (such as the Gaussian RBF kernel k(t) = e\u2212t2/2/\n\n2\u03c0) with\n\n0 (cid:54) k(t) (cid:54) \u03bamax,\n\nk(t) dt = 1,\n\n(8)\nThe convolution results in a smoothed intensity aK(S, t) = K(t) (cid:63) (df (S, t)/dt) = d(K(t) (cid:63)\n\u039b(S, t))/dt. By the property of convolution and exchanging derivative with integral, we have that\naK(S, t) = d(Z \u00b7 E\u03c4\u223cQ(\u03c4 )[K(t) (cid:63) \u03c6(\u03c7(cid:62)\n\nt k(t) dt = 0,\n\nand \u03c32\n\nk :=\n\nt2k(t) dt < \u221e.\n\n(cid:2)d(K(t) (cid:63) \u03c6(\u03c7(cid:62)\n\nS r(t|\u03c4 )])/dt\n\nS r(t|\u03c4 ))/dt(cid:3)\n\n= Z \u00b7 E\u03c4\u223cQ(\u03c4 )\n= Z \u00b7 E\u03c4\u223cQ(\u03c4 ) [K(t) (cid:63) \u03b4(t \u2212 t(S, r)]\n= Z \u00b7 E\u03c4\u223cQ(\u03c4 ) [K(t \u2212 t(S, \u03c4 ))]\n\nby de\ufb01nition of f (\u00b7)\nexchange derivative and integral\nby property of convolution and function \u03c6(\u00b7)\nby de\ufb01nition of \u03b4(\u00b7)\n\n(cid:90)\n\n(cid:90)\n\nS r(t|\u03c4 )) jumps from 0 to 1. If we choose small enough\nwhere t(S, \u03c4 ) is the time when function \u03c6(\u03c7(cid:62)\nkernel bandwidth, aK only incurs a small bias from a. But the smoothed intensity still results in\nin\ufb01nite number of parameters, due to the unknown distribution Q(\u03c4 ). To address this problem, we\ndesign the following random approximation with \ufb01nite number of parameters.\n\n(cid:90)\n\ni=1\n\nZ\nC\n\n(cid:41)\n\nA =\n\nESEt\n\n(cid:54) ZC\n\n(cid:82) T\n\nW(cid:88)\n\n(cid:54) (cid:107)w(cid:107)1\n\nw (S, t) =\naK\n\nRandom Function Approximation The key idea is to sample a collection of W random change\npoints \u03c4 from a known distribution Q(cid:48)(\u03c4 ) which can be different from Q(\u03c4 ). If Q(cid:48)(\u03c4 ) is not very\nfar way from Q(\u03c4 ), the random approximation will be close to aK, and thus close to a. More\nspeci\ufb01cally, we will denote the space of weighted combination of W random kernel function by\n,{\u03c4i} i.i.d.\u223c Q(cid:48)(\u03c4 ).\n\nLemma 2. If W = \u02dcO(Z 2/(\u0001\u03c3)2), then with probability (cid:62) 1 \u2212 \u03b4, there exists an(cid:101)a \u2208 A such that\n\n(cid:40)\n(cid:2)(a(S, t) \u2212(cid:101)a(S, t))2(cid:3) := ES\u223cP(S)\n\n(cid:2)(a(S, t) \u2212(cid:101)a(S, t))2(cid:3) dt/T = O(\u00012 + \u03c34).\n\nwi K(t \u2212 t(S, \u03c4i)) : w (cid:62) 0,\n\n\u0001) to get O(\u00012) approximation error.\n\n\u221a\nThe lemma then suggests to set the kernel bandwidth \u03c3 = O(\n5 Learning Algorithm\nWe develop a learning algorithm, referred to as TCOVERAGELEARNER, to estimate the parameters\nw (S, t) by maximizing the joint likelihood of all observed events based on convex optimization\nof aK\ntechniques as follows.\nMaximum Likelihood Estimation Instead of directly estimating the time-varying coverage func-\ntion, which is the cumulative intensity function of the counting process, we turn to estimate\nthe intensity function a(S, t) = \u2202\u039b(S, t)/\u2202t. Given m i.i.d. counting processes, Dm :=\n{(S1, N1(t)), . . . , (Sm, Nm(t))} up to observation time T , the log-likelihood of the dataset is [11]\n\n(9)\n\n0\n\n(cid:96)(Dm|a) =\n\n{log a(Si, t)} dNi(t) \u2212\n\na(Si, t) dt\n\n(10)\nMaximizing the log-likelihood with respect to the intensity function a(S, t) then gives us the esti-\n\nmation(cid:98)a(S, t). The W -term random kernel function approximation reduces a function optimization\n\nproblem to a \ufb01nite dimensional optimization problem, while incurring only small bias in the esti-\nmated function.\n\ni=1\n\n0\n\n0\n\n.\n\n(cid:90) T\n\n(cid:41)\n\n(cid:40)(cid:90) T\n\nm(cid:88)\n\n4\n\n\fAlgorithm 1 TCOVERAGELEARNER\n\nINPUT : {(Si, Ni(t))} , i = 1, . . . , m;\nSample W random features \u03c41, . . . , \u03c4W from Q(cid:48)(\u03c4 );\nCompute {t(Si, \u03c4w)} ,{gi} ,{k(tij)} , i \u2208 {1, . . . , m} , w = 1, . . . , W, tij < T ;\nInitialize w0 \u2208 \u2126 = {w (cid:62) 0,(cid:107)w(cid:107)1 (cid:54) 1};\nApply projected quasi-newton algorithm [12] to solve 11;\nOUTPUT : aK\n\ni=1 wi K(t \u2212 t(S, \u03c4i))\n\nConvex Optimization. By plugging the parametrization aK\nwe formulate the optimization problem as :\n\nw (S, t) (9) into the log-likelihood (10),\n\nlog(cid:0)w(cid:62)k(tij)(cid:1)\uf8fc\uf8fd\uf8fe subject to w (cid:62) 0, (cid:107)w(cid:107)1 (cid:54) 1,\n\nmin\nw\n\nwhere we de\ufb01ne\n\nw (S, t) =(cid:80)W\n\uf8f1\uf8f2\uf8f3w(cid:62)gi \u2212 (cid:88)\n(cid:90) T\n\nm(cid:88)\n\ni=1\n\ntij <T\n\ngik =\n\n0\n\n(11)\n\n(12)\n\n\uf8fc\uf8fd\uf8fe .\n\nK (t \u2212 t(Si, \u03c4k)) dt\n\nand kl(tij) = K(tij \u2212 t(Si, \u03c4l)),\n\ntij when the j-th event occurs in the i-th counting process. By treating the normalization constant\nZ as a free variable which will be tuned by cross validation later, we simply require that (cid:107)w(cid:107)1 (cid:54) 1.\nBy applying the Gaussian RBF kernel, we can derive a closed form of gik and the gradient (cid:79)(cid:96) as\n\n(cid:18) T \u2212 t(Si, \u03c4k)\n\n(cid:19)(cid:27)\n\n\u221a\n\n2h\n\n, (cid:79)(cid:96) =\n\n\uf8f1\uf8f2\uf8f3gi \u2212 (cid:88)\n\ntij <T\n\nm(cid:88)\n\ni=1\n\nk(tij)\nw(cid:62)k(tij)\n\n(cid:26)\n\n(cid:18)\n\n(cid:19)\n\ngik =\n\n1\n2\n\nerfc\n\n\u2212 t(Si, \u03c4k)\u221a\n\n2h\n\n\u2212 erfc\n\ni=1 on each random feature {\u03c4w}W\n\n(13)\nA pleasing feature of this formulation is that it is convex in the argument w, allowing us to apply\nvarious convex optimization techniques to solve the problem ef\ufb01ciently. Speci\ufb01cally, we \ufb01rst draw\nW random features \u03c41, . . . , \u03c4W from Q(cid:48)(\u03c4 ). Then, we precompute the jumping time t(Si, \u03c4w)\nfor every source set {Si}m\nw=1. Because in general |Si| << n,\nthis computation costs O(mW ). Based on the achieved m-by-W jumping-time matrix, we prepro-\ni=1 and k(tij), i \u2208 {1, . . . , m} , tij < T , which costs O(mW ) and\ncess the feature vectors {gi}m\nO(mLW ) where L is the maximum number of events caused by a particular source set before time\nT . Finally, we apply the projected quasi-newton algorithm [12] to \ufb01nd the weight w that minimizes\nthe negative log-likelihood of observing the given event data. Because the evaluation of the objective\nfunction and the gradient, which costs O(mLW ), is much more expensive than the projection onto\nthe convex constraint set, and L << n, the worst case computation complexity is thus O(mnW ).\nAlgorithm 1 summarizes the above steps in the end.\nSample Strategy. One important constitution of our parametrization is to sample W random change\npoints \u03c4 from a known distribution Q(cid:48)(\u03c4 ). Because given a set Si, we can only observe the jumping\ntime of the events in each counting process without knowing the identity of the covered items (which\nis a key difference from [8]), the best thing we can do is to sample from these historical data.\nSpeci\ufb01cally, let the number of counting processes that a single item s \u2208 V is involved to induce\nbe Ns, and the collection of all the jumping timestamps before time T be Js. Then, for the s-th\nentry of \u03c4 , with probability |Js|/nNs, we uniformly draw a sample from Js; and with probability\n1 \u2212 |Js|/nNs, we assign a time much greater than T to indicate that the item will never be covered\nuntil in\ufb01nity. Given the very limited information, although this Q(cid:48)(\u03c4 ) might be quite different from\nQ(\u03c4 ), by drawing suf\ufb01ciently large number of samples and adjusting the weights, we expect it still\ncan lead to good results, as illustrated in our experiments later.\n\n6 Sample Complexity\n\nSuppose we use W random features and m training examples to compute an \u0001(cid:96)-MLE solution(cid:98)a, i.e.,\nThe goal is to analyze how well the function (cid:98)f induced by(cid:98)a approximates the true function f. This\n\n(cid:96)(Dm|(cid:98)a) (cid:62) max\n\na(cid:48)\u2208A (cid:96)(Dm|a(cid:48)) \u2212 \u0001(cid:96).\n\nsections describes the intuition and the complete proof is provided in the appendix.\n\n5\n\n\fA natural choice for connecting the error between f and (cid:98)f with the log-likelihood cost used in MLE\nh(a,(cid:98)a) between(cid:98)a and the true intensity a, for which we need to show a high probability bound on\nthe (total) empirical Hellinger distance (cid:98)H 2(a, a(cid:48)) between the two. Here, h and (cid:98)H are de\ufb01ned as\n\nis the Hellinger distance [13]. So it suf\ufb01ces to prove an upper bound on the Hellinger distance\n\n(cid:104)(cid:112)a(S, t) \u2212(cid:112)a(cid:48)(S, t)\n(cid:105)2\n(cid:90) T\n(cid:104)(cid:112)a(Si, t) \u2212(cid:112)a(cid:48)(Si, t)\n(cid:105)2\n\n,\n\ndt.\n\nh2(a, a(cid:48)) :=\n\n(cid:98)H 2(a, a(cid:48)) :=\n\n1\n2\n1\n2\n\nESEt\n\nm(cid:88)\n\ni=1\n\n0\n\nThe key for the analysis is to show that the empirical Hellinger distance can be bounded by a mar-\ntingale plus some other additive error terms, which we then bound respectively. This martingale is\nde\ufb01ned based on our hypotheses and the martingales Mi associated with the counting process Ni:\n\nM (t|g) :=\n\n(cid:32)(cid:88)\n(cid:90) t\n(cid:110)\n2a : a(cid:48) \u2208 A(cid:111)\nLemma 3. Suppose(cid:98)a is an \u0001(cid:96)-MLE. Then\n(cid:98)H 2 ((cid:98)a, a) (cid:54) 16M (T ; g(cid:98)a) + 4\n\nwhere g \u2208 G =\n\n2 log a+a(cid:48)\n\nga(cid:48) = 1\n\ng(t)d\n\n0\n\ni\n\n(cid:20)\n\n(cid:33)\n\n(cid:90) t\n\nm(cid:88)\n\ni=1\n\n0\n\nMi(t)\n\n=\n\ng(t)dMi(t)\n\n. More precisely, we have the following lemma.\n\n(cid:21)\na(cid:48)\u2208A (cid:96)(Dm|a(cid:48))\n\n(cid:96)(Dm|a) \u2212 max\n\n+ 4\u0001(cid:96).\n\nThe right hand side has three terms: the martingale (estimation error), the likelihood gap between\nthe truth and the best one in our hypothesis class (approximation error), and the optimization error.\nWe then focus on bounding the martingale and the likelihood gap.\nTo bound the martingale, we \ufb01rst introduce a notion called (d, d(cid:48))-covering dimension measuring\nthe complexity of the hypothesis class, generalizing that in [14]. Based on this notion, we prove\na uniform convergence inequality, combining the ideas in classic works on MLE [14] and count-\ning process [15]. Compared to the classic uniform inequality, our result is more general, and the\ncomplexity notion has more clear geometric interpretation and are thus easier to verify. For the like-\nlihood gap, recall that by Lemma 2, there exists an good approximation \u02dca \u2208 A. The likelihood gap\nis then bounded by that between a and \u02dca, which is small since a and \u02dca are close.\nCombining the two leads to a bound on the Hellinger distance based on bounded dimension of the\nhypothesis class. We then show that the dimension of our speci\ufb01c hypothesis class is at most the\n\nnumber of random features W , and convert (cid:98)H 2((cid:98)a, a) to the desired (cid:96)2 error bound on f and (cid:98)f.\n[W + \u0001(cid:96)](cid:1). Then\n\nand m = \u02dcO(cid:0) ZT\n\n(cid:18)\n\nZ 2\n\n+\n\nTheorem 4. Suppose W = \u02dcO\nwith probability (cid:62) 1 \u2212 \u03b4 over the random sample of {\u03c4i}W\n\n\u0001amin\n\n\u0001\n\ni=1, we have that for any 0 (cid:54) t (cid:54) T ,\n\n(cid:20)(cid:0) ZT\n(cid:16) ZT\n(cid:1)5/2\n(cid:104)(cid:98)f (S, t) \u2212 f (S, t)\n\n(cid:17)5/4(cid:21)(cid:19)\n(cid:105)2 (cid:54) \u0001.\n\nES\n\n\u0001\n\nThe theorem shows that the number of random functions needed to achieve \u0001 error is roughly\nO(\u0001\u22125/2), and the sample size is O(\u0001\u22127/2). They also depend on amin, which means with more\nrandom functions and data, we can deal with intensities with more extreme values. Finally, they\nincrease with the time T , i.e., it is more dif\ufb01cult to learn the function values at later time points.\n7 Experiments\nWe evaluate TCOVERAGELEARNER on both synthetic and real world information diffusion data.\nWe show that our method can be more robust to model misspeci\ufb01cation than other state-of-the-art\nalternatives by learning a temporal coverage function all at once.\n7.1 Competitors\nBecause our input data only include pairs of a source set and the temporal information of its trig-\ngered events {(Si, Ni(t))}m\ni=1 with unknown identity, we \ufb01rst choose the general kernel ridge re-\ngression model as the major baseline, which directly estimates the in\ufb02uence value of a source set\n\n6\n\n\f(a) Weibull (CIC)\n\n(b) Exponential (CIC)\n\n(c) DIC\n\n(d) LT\n\nFigure 1: MAE of the estimated in\ufb02uence on test data along time with the true diffusion model being\ncontinuous-time independent cascade with pairwise Weibull (a) and Exponential (b) transmission\nfunctions, (c) discrete-time independent cascade model and (d) linear-threshold cascade model.\n\u03c7S by f (\u03c7S ) = k(\u03c7S )(K + \u03bbI)\u22121y where k(\u03c7S ) = K(\u03c7Si, \u03c7S ), and K is the kernel ma-\ntrix. We discretize the time into several steps and \ufb01t a separate model to each of them. Between\ntwo consecutive time steps, the predictions are simply interpolated. In addition, to further demon-\nstrate the robustness of TCOVERAGELEARNER, we compare it to the two-stage methods which\nmust know the identity of the nodes involved in an information diffusion process to \ufb01rst learn\na speci\ufb01c diffusion model based on which they can then estimate the in\ufb02uence. We give them\nsuch an advantage and study three well-known diffusion models : (I) Continuous-time Independent\nCascade model(CIC)[16, 17]; (II) Discrete-time Independent Cascade model(DIC)[1]; and (III)\nLinear-Threshold cascade model(LT)[1].\n\n\u221a\n\nIn\ufb02uence Estimation on Synthetic Data\n\n7.2\nWe generate Kronecker synthetic networks ([0.9 0.5;0.5 0.3]) which mimic real world information\ndiffusion patterns [18]. For CIC, we use both Weibull distribution (Wbl) and Exponential distribu-\ntion (Exp) for the pairwise transmission function associated with each edge, and randomly set their\nparameters to capture the heterogeneous temporal dynamics. Then, we use NETRATE [16] to learn\nthe model by assuming an exponential pairwise transmission function. For DIC, we choose the pair-\nwise infection probability uniformly from 0 to 1 and \ufb01t the model by [19]. For LT, we assign the edge\nweight wuv between u and v as 1/dv, where dv is the degree of node v following [1]. Finally, 1,024\nsource sets are sampled with power-law distributed cardinality (with exponent 2.5), each of which\ninduces eight independent cascades(or counting processes), and the test data contains another 128\nindependently sampled source sets with the ground truth in\ufb02uence estimated from 10,000 simulated\ncascades up to time T = 10. Figure 1 shows the MAE(Mean Absolute Error) between the estimated\nin\ufb02uence value and the true value up to the observation window T = 10. The average in\ufb02uence\nis 16.02, 36.93, 9.7 and 8.3. We use 8,192 random features and two-fold cross validation on the\ntrain data to tune the normalization Z, which has the best value 1130, 1160, 1020, and 1090, respec-\ntively. We choose the RBF kernel bandwidth h = 1/\n2\u03c0 so that the magnitude of the smoothed\napproximate function still equals to 1 (or it can be tuned by cross-validation as well), which matches\nthe original indicator function. For the kernel ridge regression, the RBF kernel bandwidth and the\nregularization \u03bb are all chosen by the same two-fold cross validation. For CIC and DIC, we learn\nthe respective model up to time T for once.\nFigure 1 veri\ufb01es that even though the underlying diffusion models can be dramatically different,\nthe prediction performance of TCOVERAGELEARNER is robust to the model changes and con-\nsistently outperforms the nontrivial baseline signi\ufb01cantly.\nIn addition, even if CIC and DIC are\nprovided with extra information, in Figure 1(a), because the ground-truth is continuous-time dif-\nfusion model with Weibull functions, they do not have good performance. CIC assumes the right\nmodel but the wrong family of transmission functions. In Figure 1(b), we expect CIC should have\nthe best performance for that it assumes the correct diffusion model and transmission functions.\nYet, TCOVERAGELEARNER still has comparable performance with even less information. In Fig-\nure 1(c), although DIC has assumed the correct model, it is hard to determine the correct step size to\ndiscretize the time line, and since we only learn the model once up to time T (instead of at each time\npoint), it is harder to \ufb01t the whole process. In Figure1(d), both CIC and DIC have the wrong model,\nso we have similar trend as Figure synthetic(a). Moreover, for kernel ridge regression, we have to\n\ufb01rst partition the timeline with arbitrary step size, \ufb01t the model to each of time, and interpolate the\nvalue between neighboring time legs. Not only will the errors from each stage be accumulated to\nthe error of the \ufb01nal prediction, but also we cannot rely on this method to predict the in\ufb02uence of a\nsource set beyond the observation window T .\n\n7\n\n12345678910051015TimeMAE TCoverageLearnerKernel Ridge RegressionCICDIC123456789100102030TimeMAE TCoverageLearnerKernel Ridge RegressionCICDIC12345678910051015TimeMAE TCoverageLearnerKernel Ridge RegressionCICDIC12345678910012345TimeMAE TCoverageLearnerKernel Ridge RegressionCICDIC\f(a) Average MAE\n\n(b) Features\u2019 Effect\n\n(c) Runtime\n\n(d) In\ufb02uence maximization\n\n(a) Average MAE from time 1 to 10 on seven groups of real cascade data; (b) Improved\nFigure 2:\nestimation with increasing number of random features; (c) Runtime in log-log scale; (d) Maximized\nin\ufb02uence of selected sources on the held-out testing data along time.\n\nOverall, compared to the kernel ridge regression, TCOVERAGELEARNER only needs to be trained\nonce given all the event data up to time T in a compact and principle way, and then can be used to in-\nfer the in\ufb02uence of any given source set at any particular time much more ef\ufb01ciently and accurately.\nIn contrast to the two-stage methods, TCOVERAGELEARNER is able to address the more general\nsetting with much less assumption and information but still can produce consistently competitive\nperformance.\n\nIn\ufb02uence Estimation on Real Data\n\n7.3\nMemeTracker is a real-world dataset [20] to study information diffusion. The temporal \ufb02ow of in-\nformation was traced using quotes which are short textual phrases spreading through the websites.\nWe have selected seven typical groups of cascades with the representative keywords like \u2018apple and\njobs\u2019, \u2018tsunami earthquake\u2019, etc., among the top active 1,000 sites. Each set of cascades is split into\n60%-train and 40%-test. Because we often can observe cascades only from single seed node, we\nrarely have cascades produced from multiple sources simultaneously. However, because our model\ncan capture the correlation among multiple sources, we challenge TCOVERAGELEARNER with sets\nof randomly chosen multiple source nodes on the independent hold-out data. Although the genera-\ntion of sets of multiple source nodes is simulated, the respective in\ufb02uence is calculated from the real\ntest data as follows : Given a source set S, for each node u \u2208 S, let C(u) denote the set of cascades\ngenerated from u on the testing data. We uniformly sample cascades from C(u). The average length\nof all sampled cascades is treated as the true in\ufb02uence of S. We draw 128 source sets and report\nthe average MAE along time in Figure 2(a). Again, we can observe that TCOVERAGELEARNER\nhas consistent and robust estimation performance across all testing groups. Figure 2(b) veri\ufb01es that\nthe prediction can be improved as more random features are exploited, because the representational\npower of TCOVERAGELEARNER increases to better approximate the unknown true coverage func-\ntion. Figure 2(c) indicates that the runtime of TCOVERAGELEARNER is able to scale linearly with\nlarge number of random features. Finally, Figure 2(d) shows the application of the learned coverage\nfunction to the in\ufb02uence maximization problem along time, which seeks to \ufb01nd a set of source nodes\nthat maximize the expected number of infected nodes by time T . The classic greedy algorithm[21]\nis applied to solve the problem, and the in\ufb02uence is calculated and averaged over the seven held-out\ntest data. It shows that TCOVERAGELEARNER is very competitive to the two-stage methods with\nmuch less assumption. Because the greedy algorithm mainly depends on the relative rank of the\nselected sources, although the estimated in\ufb02uence value can be different, the selected set of sources\ncould be similar, so the performance gap is not large.\n\n8 Conclusions\n\nWe propose a new problem of learning temporal coverage functions with a novel parametrization\nconnected with counting processes and develop an ef\ufb01cient algorithm which is guaranteed to learn\nsuch a combinatorial function from only polynomial number of training samples. Empirical study\nalso veri\ufb01es our method outperforms existing methods consistently and signi\ufb01cantly.\n\nAcknowledgments This work was supported in part by NSF grants CCF-0953192, CCF-1451177,\nCCF-1101283, and CCF-1422910, ONR grant N00014-09-1-0751, AFOSR grant FA9550-09-1-\n0538, Raytheon Faculty Fellowship, NSF IIS1116886, NSF/NIH BIGDATA 1R01GM108341, NSF\nCAREER IIS1350983 and Facebook Graduate Fellowship 2014-2015.\n\n8\n\n12345670510152025Groups of MemesAverage MAE TCoverageLearnerKernel Ridge RegressionCICDIC12825651210242048409681920246810# Random featuresAverage MAE1282565121024204840968192100101102103time(s)# random features1234567891020406080100Timeinfluence TCoverageLearnerKernel Ridge RegressionCICDIC\fReferences\n[1] D. Kempe, J. Kleinberg, and \u00b4E. Tardos. Maximizing the spread of in\ufb02uence through a social network. In\n\nSIGKDD, pages 137\u2013146. ACM, 2003.\n\n[2] C. Guestrin, A. Krause, and A. Singh. Near-optimal sensor placements in gaussian processes. In Interna-\n\ntional Conference on Machine Learning ICML\u201905, 2005.\n\n[3] B. Lehmann, D. Lehmann, and N. Nisan. Combinatorial auctions with decreasing marginal utilities. In\n\nProceedings of the 3rd ACM conference on Electronic Commerce, pages 18\u201328. ACM, 2001.\n\n[4] M.F. Balcan and N. Harvey. Learning submodular functions. In Proceedings of the 43rd annual ACM\n\nsymposium on Theory of computing, pages 793\u2013802. ACM, 2011.\n\n[5] A. Badanidiyuru, S. Dobzinski, H. Fu, R. D. Kleinberg, N. Nisan, and T. Roughgarden. Sketching valua-\n\ntion functions. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, 2012.\n\n[6] V. Feldman and P. Kothari. Learning coverage functions. arXiv preprint arXiv:1304.2079, 2013.\n[7] V. Feldman and J. Vondrak. Optimal bounds on approximation of submodular and xos functions by juntas.\n\nIn FOCS, 2013.\n\n[8] N. Du, Y. Liang, N. Balcan, and L. Song. In\ufb02uence function learning in information diffusion networks.\n\nIn International Conference on Machine Learning (ICML), 2014.\n\n[9] L. Song, M. Kolar, and E. P. Xing. Time-varying dynamic bayesian networks. In Neural Information\n\nProcessing Systems, pages 1732\u20131740, 2009.\n\n[10] M. Kolar, L. Song, A. Ahmed, and E. P. Xing. Estimating time-varying networks. Ann. Appl. Statist.,\n\n4(1):94\u2013123, 2010.\n\n[11] O. Aalen, O. Borgan, and H. Gjessing. Survival and event history analysis: a process point of view.\n\nSpringer, 2008.\n\n[12] M. Schmidt, E. van den Berg, M. P. Friedlander, and K. Murphy. Optimizing costly functions with simple\nconstraints: A limited-memory projected quasi-newton algorithm. In D. van Dyk and M. Welling, editors,\nProceedings of The Twelfth International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS)\n2009, volume 5, pages 456\u2013463, Clearwater Beach, Florida, April 2009.\n\n[13] Sara van de Geer. Hellinger-consistency of certain nonparametric maximum likelihood estimators. The\n\nAnnals of Statistics, pages 14\u201344, 1993.\n\n[14] L. Birg\u00b4e and P. Massart. Minimum Contrast Estimators on Sieves: Exponential Bounds and Rates of\n\nConvergence. Bernoulli, 4(3), 1998.\n\n[15] Sara van de Geer. Exponential inequalities for martingales, with application to maximum likelihood\n\nestimation for counting processes. The Annals of Statistics, pages 1779\u20131801, 1995.\n\n[16] M. Gomez-Rodriguez, D. Balduzzi, and S. Bernhard. Uncovering the temporal dynamics of diffusion\n\nnetworks. In Proceedings of the International Conference on Machine Learning, 2011.\n\n[17] N. Du, L. Song, M. Gomez-Rodriguez, and H.Y. Zha. Scalable in\ufb02uence estimation in continuous time\n\ndiffusion networks. In Advances in Neural Information Processing Systems 26, 2013.\n\n[18] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An\n\napproach to modeling networks. Journal of Machine Learning Research, 11(Feb):985\u20131042, 2010.\n\n[19] P. Netrapalli and S. Sanghavi.\n\nLearning the graph of epidemic cascades.\n\nRICS/PERFORMANCE, pages 211\u2013222. ACM, 2012.\n\nIn SIGMET-\n\n[20] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In Pro-\nceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 497\u2013506. ACM, 2009.\n\n[21] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of the approximations for maximizing submodular\n\nset functions. Mathematical Programming, 14:265\u2013294, 1978.\n[22] L. Wasserman. All of Nonparametric Statistics. Springer, 2006.\n[23] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with random-\n\nization in learning. In Neural Information Processing Systems, 2009.\n\n[24] G.R. Shorack and J.A. Wellner. Empirical Processes with Applications to Statistics. Wiley, New York,\n\n1986.\n\n[25] Wing Hung Wong and Xiaotong Shen. Probability inequalities for likelihood ratios and convergence rates\n\nof sieve mles. The Annals of Statistics, pages 339\u2013362, 1995.\n\n[26] Kenneth S Alexander. Rates of growth and sample moduli for weighted empirical processes indexed by\n\nsets. Probability Theory and Related Fields, 75(3):379\u2013423, 1987.\n\n9\n\n\f", "award": [], "sourceid": 1724, "authors": [{"given_name": "Nan", "family_name": "Du", "institution": "Georgia Tech"}, {"given_name": "Yingyu", "family_name": "Liang", "institution": "Princeton University"}, {"given_name": "Maria-Florina", "family_name": "Balcan", "institution": "Georgia Tech"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Tech"}]}