{"title": "Tomography of the London Underground: a Scalable Model for Origin-Destination Data", "book": "Advances in Neural Information Processing Systems", "page_first": 3062, "page_last": 3073, "abstract": "The paper addresses the classical network tomography problem of inferring local traffic given origin-destination observations. Focussing on large complex public transportation systems, we build a scalable model that exploits input-output information to estimate the unobserved link/station loads and the users path preferences. Based on the reconstruction of the users' travel time distribution, the model is flexible enough to capture possible different path-choice strategies and correlations between users travelling on similar paths at similar times. The corresponding likelihood function is intractable for medium or large-scale networks and we propose two distinct strategies, namely the exact maximum-likelihood inference of an approximate but tractable model and the variational inference of the original intractable model. As an application of our approach, we consider the emblematic case of the London Underground network, where a tap-in/tap-out system tracks the start/exit time and location of all journeys in a day. A set of synthetic simulations and real data provided by Transport For London are used to validate and test the model on the predictions of observable and unobservable quantities.", "full_text": "Tomography of the London Underground:\na Scalable Model for Origin-Destination Data\n\nNicol\u00f2 Colombo\n\nDepartment of Statistical Science\n\nUniversity College London\n\nnicolo.colombo@ucl.ac.uk\n\nRicardo Silva\n\nThe Alan Turing Institute and\n\nDepartment of Statistical Science\n\nUniversity College London\nricardo.silva@ucl.ac.uk\n\nSoong Kang\n\nSchool of Management\n\nUniversity College London\n\nsmkang@ucl.ac.uk\n\nAbstract\n\nThe paper addresses the classical network tomography problem of inferring local\ntraf\ufb01c given origin-destination observations. Focusing on large complex public\ntransportation systems, we build a scalable model that exploits input-output infor-\nmation to estimate the unobserved link/station loads and the users\u2019 path preferences.\nBased on the reconstruction of the users\u2019 travel time distribution, the model is\n\ufb02exible enough to capture possible different path-choice strategies and correlations\nbetween users travelling on similar paths at similar times. The corresponding\nlikelihood function is intractable for medium or large-scale networks and we pro-\npose two distinct strategies, namely the exact maximum-likelihood inference of\nan approximate but tractable model and the variational inference of the original\nintractable model. As an application of our approach, we consider the emblematic\ncase of the London underground network, where a tap-in/tap-out system tracks\nthe starting/exit time and location of all journeys in a day. A set of synthetic\nsimulations and real data provided by Transport For London are used to validate\nand test the model on the predictions of observable and unobservable quantities.\n\n1\n\nIntroduction\n\nIn the last decades, networks have been playing an increasingly important role in our all-day lives\n[1, 2, 3, 4, 5, 6]. Most of the time, networks cannot be inspected directly and their properties should be\nreconstructed form end-point or partial and local observations [7, 8]. The problem has been referred\nto as network \u2018tomography\u2019, a medical word to denote clinical techniques that produce detailed\nimages of the interior of the body from external signals [9, 10]. Nowadays the concept of tomography\nhas gained wider meanings and the idea applies, in different forms, to many kinds of communication\nand transportation networks [11, 12, 13]. In particular, as the availability of huge amounts of data has\ngrown exponentially, network tomography has become an important branch of statistical modelling\n[14, 15, 16, 17, 8]. However, due to the complexity of the task, existing methods are usually only\ndesigned for small-size networks and become intractable for most real-world applications (see [7, 18]\nfor a discussion on this point). The case of large public transportation networks has attracted special\nattention since massive datasets of input-output single-user data have been produced by tap-in and\ntap-out systems installed in big city as London, Singapore and Beijing [19, 20, 18, 21].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fDepending on the available measurements, two complementary formulations of network tomography\nhave been considered: (i) the reconstruction of origin-destination distributions from local and partial\ntraf\ufb01c observations [11, 14, 9, 15, 16] and (ii) the estimation of the link and node loads from input-\noutput information [22, 23, 24]. In practice, the knowledge of the unobserved quantities may help\ndesign structural improvements of the network or be used to predict the system\u2019s behaviour in case of\ndisruptions [25, 26, 13, 27, 28]. Focusing on the second (also referred to as \u2018dual\u2019) formulation of\nthe tomography problem, this paper addresses the challenging case where both the amount of data\nand the size of the network are large. When only aggregated data are observable, traf\ufb01c \ufb02ows over\na given network can also be analysed by methods such as collective graphical models for diffusion\ndynamics [29, 30].\nAn important real-world application of dual network tomography is reconstructing the traf\ufb01c of bits\nsent from a source node to a destination node in a network of servers, terminals and routers. The usual\nassumption, in those cases, is the tree structure of the network and models infer the bits trajectories\nfrom a series of local delays, i.e. loss functions de\ufb01ned at each location in the network [22, 23, 24].\nThe posterior of the travel time distribution at each intermediate position along the path is then used\nto reconstruct the unobserved local loads, i.e. the number of packets at a given node and time. We\nextend and apply this general idea to urban public transportation systems. The traf\ufb01c to be estimated\nis the \ufb02ow of people travelling across the system during a day, i.e. the number of people at a given\nlocation and time (station/link load). The nodes of the network are (> 100) underground stations,\nconnected via (\u223c 10) partially overlapping underground \u2018lines\u2019, which can be look at as interacting\n\u2018layers\u2019 of connectivity [31]. The observations are single-user records with information about the\norigin, destination, starting time and exit time of each journey. Two key unobserved quantities to be\nestimated are (i) the users\u2019 path preferences for a given origin-destination pair [32, 28] and (ii) the\nstation/link loads [33, 34, 35]. Put together, a model for the users\u2019 path preferences and a precise\nestimation of the local train loads can help detect network anomalies or predict the behaviour of the\nsystem in case of previously unobserved disruptions [18, 27, 21].\nRespect to the classical communication network case, modelling a complex transportation system\nrequires three challenging extensions: (i) the network structure is a multi-layer (loopy) network,\nwhere users are allowed to \u2018change line\u2019 on those nodes that are shared by different layers; (ii) the\nuser\u2019s choice between many feasible paths follows rules that can go far beyond simple length-related\nschemes; (iii) harder physical constraints (the train time schedule for example) may create high\ncorrelations between users travelling on the same path at similar times. Taking into a account such\npeculiar features of transportation networks, while keeping the model scalable respect to both the\nsize of the network and the dataset, is the main contribution of this work.\n\nModel outline We represent the transportation system by a sparse graph, where each node is\nassociated with an underground station and each edge with a physical connection between two\nstations. The full network is the sum of simple sub-graphs (lines) connected by sets of shared nodes\n(where the users are allowed to change line) [31]. For a given origin-destination pair, there may exist\na \ufb01nite number of possible simple (non redundant) trajectories, corresponding to distinct line-change\nstrategies. The unobserved user\u2019s choice is treated as a latent variable taking values over the set of all\nfeasible paths between the origin and destination. The corresponding probability distribution may\ndepend on the length of the path, i.e. the number of nodes crossed by the path, or any other arbitrary\nfeature of the path. In our multi-layers setup, for example, it is natural to include a \u2018depth\u2019 parameter\ntaking into account the number of layers visited, i.e. the number of lines changes.\nFor any feasible path \u03b3 = [\u03b31, . . . , \u03b3(cid:96)], the travel time at the intermediate stations is de\ufb01ned by the\nrecursive relation\n\nt(\u03b3i) \u223c t(\u03b3i\u22121) + Poisson(a(\u03b3i\u22121, so + t(\u03b3i\u22121), \u03b3))\n\n(1)\nwhere t(x) the is travel time at location x \u2208 {\u03b31, . . . , \u03b3(cid:96)}, so is the starting time, a = a(x, so+t(x), \u03b3)\nare local delays that depend on the location, x, the absolute time so + t(x) and the path \u03b3. The choice\nof the Poisson distribution is convenient 1 in this framework due to its simple single-parameter form\nand the fact that t(x) is an integer in the dataset that motivates this work (travel time is recorded in\nminutes). The dependence on \u03b3 allows including global path-related features, such as, for example,\nan extra delay associated to each line change along the path or the time spent by the user while\nwalking through the origin and destination stations. The dependence on so and t(x) is what ensures\n\ni = 1, . . . , (cid:96)\n\n1Other options include negative binomial and shifted geometric distributions\n\n2\n\n\fthe scalability of the model because all users can be treated independently given their starting time.\nThe likelihood associated with all journeys in a day has a factorised form\n\nN(cid:89)\n\np(t(n)\n\nd |s(n)\no )\n\n(2)\n\np(t(1)\n\nd , . . . , t(N )\n\nd\n\n|s(1)\no , . . . , s(N )\n\no\n\n) =\n\nn=1\n\nwhere t(n)\nis the total travel time of the nth user and N the total number of users in a day and each\nd\nd |s(n)\np(t(n)\no ) depends only locally on the model parameters, i.e. on the delay functions associated\nwith the nodes crossed by the corresponding path. The drawback is that an exact computation of (2)\nis intractable and one needs approximate inference methods to identify the model parameters from\nthe data.\nWe address the inference problem in two complementary ways. The \ufb01rst one is a model-approximation\nmethod, where we perform the exact inference of the approximate (tractable) model\n\nt(\u03b3i) \u223c t(\u03b3i\u22121) + Poisson(a(\u03b3i\u22121, so + \u00afti\u22121, \u03b3))\n\ni = 1, . . . , (cid:96)\n\n(3)\n\nwhere \u00afti\u22121 is a deterministic function of the model parameters that is de\ufb01ned by the difference\nequation\n\n\u00afti = \u00afti\u22121 + a(\u03b3i\u22121, so + \u00afti\u22121, \u03b3)\n\ni = 1, . . . , (cid:96)\n\n(4)\n\nThe second one is a variational inference approach where we maximise a lower bound of the\nintractable likelihood associated with (1). In both cases, we use stochastic gradient updates to solve\niteratively the corresponding non-convex optimization. Since the closed form solution of (4) is in\ngeneral not available, the gradients of the objective functions cannot be computed explicitly. At each\niteration, they are obtained recursively from a set of difference equations derived from (4), following\na scheme that can be seen as a simple version of the back-propagation method used to train neural\nnetworks. Finally, we initialize the iterative algorithms by means of a method of moments estimation\nof the time-independent part of the delay functions. Choosing a random distribution over the feasible\npaths, this is obtained from the empirical moments of the travel time distribution (of the approximate\nmodel (10)) by solving a convex optimization problem.\n\nLondon underground experiments The predictive power of our model is tested via a series of\nsynthetic and real-world experiments based on the London underground network. All details of the\nmulti-layer structure of the network can be found in [36]. In the training step we use input-output data\nthat contain the origin, the destination, the starting time and the exit time of each (pseudonymised) user\nof the system. This kind of data are produced nowadays by tap-in/tap-out smart card systems such as\nthe Oyster Card systems in London [19]. The trained models can then used to predict the unobserved\nnumber of people travelling through a given station at a given time in the day, as well as the user\u2019s\npath preferences for given origin-destination pair. In the synthetic experiments, we compared the\nmodel estimations with the values produced by the \u2018ground-truth\u2019 (a set of random parameters used\nto generate the synthetic data) and test the performance of the two proposed inference methods. In\nthe real-world experiment, we used original pseudonymised data provided by Transport for London.\nThe dataset consisted of more than 500000 origin destination records, from journeys realised in a\nsingle day on the busiest part of the London underground network (Zone 1 and 2, see [36]), and a\nsubset of NetMIS records [37] from the same day. NetMIS data contain realtime information about\nthe trains transiting through a given station and, for an handful of major underground stations (all of\nthem on the Victoria line), include quantitative estimation of the realtime train weights. The latter\ncan be interpreted as a proxy of the realtime (unobserved) number of people travelling through the\ncorresponding nodes of the network and used to evaluate the model\u2019s predictions in a quantitative\nway. The model has also been tested on a out-of-sample Oyster-card dataset by comparing expected\nand observed travel time between a selection of station pairs. Unfortunately, we are not aware of any\nexisting algorithm that could be applicable for a fair comparison on similar settings.\n\n2 Travel time model\n\nLet o, d and so be the origin, the destination and the starting time of a user travelling through the\nsystem. Let \u0393od be the set of all feasible paths between o and d. Then the probability of observing a\n\n3\n\n\ftravel time td is a mixture of probability distributions\n\np(td) =\n\nppath(\u03b3)p(td|\u03b3)\n\nppath(\u03b3) =\n\n(cid:88)\n\n\u03b3\u2208\u0393od\n\n(cid:80)\n\ne\u2212L(\u03b3)\n\u03b3\u2208\u0393od\n\ne\u2212L(\u03b3)\n\n(5)\n\nwhere the conditional p(td|\u03b3) can be interpreted as the travel time probability over a particular path,\nppath(\u03b3) is the probability of choosing that particular path and L(\u03b3) is some arbitrary \u2018effective length\u2019\nof the path \u03b3. According to (1), the conditional probabilities p(td|\u03b3) are complicated convolutions of\nPoisson distributions. An equivalent but more intuitive formulation is\n\ntd =\n\nri\n\nri \u223c Poisson(a(\u03b3i\u22121, so +\n\nrk, \u03b3))\n\n\u03b3 \u223c Ppath(L(\u03b3))\n\n(6)\n\nwhere the travel time td is explicitly expressed as the sum of the local delays, ri = t(\u03b3i) \u2212 t(\u03b3i\u22121),\nalong a feasible path \u03b3 \u2208 \u0393od. Since the time at the intermediate positions, i.e. t(\u03b3i) for i (cid:54)= 1, (cid:96), is not\nobserved, the local delays r2, . . . , r(cid:96)(\u03b3) are treated as hidden variables. Letting \u00af(cid:96) = max\u03b3\u2208\u0393od (cid:96)(\u03b3),\nthe complete likelihood is\n\n(cid:96)(cid:88)\n\nk=2\n\n(cid:96)(\u03b3)(cid:88)\n\ni=2\n\n\u00af(cid:96)(cid:89)\n\ni=1\n\ne\u2212\u03bbi\u03bbri\n\ni\n\np(r1, . . . , r\u00af(cid:96)|\u03b3) =\n\np(r1, . . . , r\u00af(cid:96), \u03b3) = p(r1, . . . , r\u00af(cid:96)|\u03b3)ppath(\u03b3)\n\nwhere \u03bbi = a(\u03b3i\u22121, so +(cid:80)i\u22121\nk=2 rk, \u03b3) if i \u2264 (cid:96)(\u03b3) and \u03bbi = 0 if i > (cid:96)(\u03b3). Marginalizing over\nall hidden variables one obtains the explicit form of the conditional probability distributions in the\n\u221e(cid:88)\nmixture (5), i.e.\n\n\u221e(cid:88)\n\n\u00af(cid:96)(cid:88)\n\ne\u2212\u03bbi\u03bbri\n\n\u00af(cid:96)(cid:89)\n\n(7)\n\nri!\n\np(td|\u03b3) =\n\n\u00b7\u00b7\u00b7\n\n\u03b4(td \u2212\n\n(8)\n\nri)\n\nr2=0\n\nr\u00af(cid:96)=0\n\ni=2\n\ni=2\n\ni\n\nri!\n\nSince \u03bbi = \u03bbi(ri\u22121, . . . , r2) for each i = 2, . . . , (cid:96), the evaluation of each conditional probabilities\nrequires performing a ((cid:96) \u2212 1)-dimensional in\ufb01nite sum, which is numerically intractable and makes\nan exact maximum likelihood approach infeasible. 2\n\n3\n\nInference\n\nAn exact maximum likelihood estimation of the model parameters in a(x, s, \u03b3) and L(\u03b3) is infeasible\ndue to the intractability of the evidence (8). One possibility is to use a Monte Carlo approximation\nof the exact evidence (8) by sampling from the nested Poisson distributions. In this section we\npropose two alternative methods that do not require sampling from the target distribution. The \ufb01rst\nmethod is based on the exact inference of an approximate but tractable model. The latter depends\non the same parameters as the original one (the \u2018reference\u2019 model (6)) but is such that the local\ndelays become independent given the path and the starting time. The second approach consists of an\napproximate variational inference of (6) with the variational posterior distribution de\ufb01ned in terms of\nthe deterministic model (4).\n\n3.1 Exact inference of an approximate model\n\nWe consider the approximation of the reference model (6) de\ufb01ned by\n\nri \u223c Poisson(a(\u03b3i\u22121, so + \u00afti\u22121, \u03b3))\n\u00af(cid:96)(cid:88)\n\u221e(cid:88)\n\n(cid:88)\n\n\u221e(cid:88)\n\ntn p(t) =\n\nppath(\u03b3)\n\n\u221e(cid:88)\n\n\u03b3 \u223c Ppath(L(\u03b3))\n\u00af(cid:96)(cid:89)\n\ne\u2212\u03bbi \u03bbri\n\ni\n\n\u00b7\u00b7\u00b7\n\n(\nr\u00af(cid:96)=0\n\nri)n\n\ni=2\n\ni=2\n\nri!\n\n(10)\n\n(9)\n\n2An exact evaluation of the moments\n\n(cid:104)tn\nd(cid:105) =\n\nis also intractable.\n\nt=0\n\n\u03b3\u2208\u0393od\n\nr2=0\n\n4\n\n(cid:96)(\u03b3)(cid:88)\n\ni=2\n\ntd =\n\nri\n\n\fwhere the \u00afti are obtained recursively from (4). In this case, the (cid:96)(\u03b3) \u2212 1 local delays ri are decoupled\nand the complete likelihood is given by\n\np(r1, . . . , r\u00af(cid:96), \u03b3) = p(r1, . . . , r\u00af(cid:96)|\u03b3)ppath(\u03b3)\n(cid:88)\n\u00af(cid:96)(cid:88)\n\n(11)\nwhere \u03bbi = a(\u03b3i\u22121, so + \u00afti\u22121(\u03b3), \u03b3) if i \u2264 (cid:96)(\u03b3) and \u03bbi = 0 if i > (cid:96)(\u03b3). Noting that td is the sum of\nindependent Poisson random variables, we have\n\u03b4(td \u2212\n\np(r1, . . . , r\u00af(cid:96)|\u03b3) =\n(cid:88)\n\u00af(cid:96)(cid:89)\n\ne\u2212\u00aft\u00af(cid:96) \u00afttd\n\u00af(cid:96)\n\ntd(cid:88)\n\ntd(cid:88)\n\ne\u2212\u03bbi\u03bbri\n\nppath(\u03b3)\n\np(td) =\n\n(12)\n\n. . . ,\n\nri)\n\nri!\n\ni=1\n\n=\n\ni\n\ni\n\nr\u00af(cid:96)=0\n\nr2=0\ni=2 \u03bbi = \u00aft\u00af(cid:96). The parameters in the model function a and L can then be\n\ni=2\n\ni=2\n\nri!\n\n\u03b3\u2208\u0393od\n\ntd!\n\nppath(\u03b3)\n\nwhere we have used(cid:80)\u00af(cid:96)\n\n\u03b3\u2208\u0393od\n\nidenti\ufb01ed with the solution of the following non-convex maximization problem\n\n\u00af(cid:96)(cid:89)\n\ne\u2212\u03bbi\u03bbri\n\nD(cid:88)\n\nD(cid:88)\n\nT\u22121(cid:88)\n\nT(cid:88)\n\nmaxa,L\n\nN (o, d, so, sd) log p(sd \u2212 so)\n\n(13)\n\nwhere N (o, d, so, sd) is the number of users travelling from o to d with enter and exit time so and sd\nrespectively.\n\no=1\n\nd=1\n\nso=0\n\nsd=so\n\n3.2 Variational inference of model the original model\n\nWe de\ufb01ne the approximate posterior distribution\n\n(cid:80)\n\ne\u2212 \u02dcL(\u03b3,td)\n\u03b3\u2208\u0393ode\u2212 \u02dcL(\u03b3,td )\n\nq(\u03b3) =\n\nq(r, \u03b3) = q(r|\u03b3)qpath(\u03b3)\n\nq(r|\u03b3) = pmulti(r; td, \u03b7)\n\u00afti\u2212\u00afti\u22121\n\npmulti(r; td, \u03b7) = \u03b4(td \u2212(cid:80)\u00af(cid:96)\n\n(14)\n, with \u00afti = \u00afti\u22121 for all (cid:96)(\u03b3) < i \u2264 \u00af(cid:96),\nwhere we have de\ufb01ned r = [r2, . . . , r\u00af(cid:96)], \u03b7i =\n\u03b7ri\nri! and the function \u02dcL(\u03b3, td) depends on the path, \u03b3,\ni\nand the observed travel time, td. Except for the corrected length \u02dcL(\u03b3, td), the variational distribution\n(14) share the same parameters over all data points and can be used directly to evaluate the likelihood\nlower bound (ELBO) L = Eq(log p(td)) \u2212 Eq(log q) 3. One has\nppath(\u03b3)\nqpath(\u03b3)\n\nL(o, d, so, td) = \u2212 log td! +\n\n(cid:88)\n\n\u00af(cid:96)(cid:88)\n\nqpath(\u03b3) log\n\nLi(\u03b3)\n\nqpath(\u03b3)\n\ni=2\n\n+\n\n\u00aft\u00af(cid:96)\n\ni=2\n\ni=2 ri)td!(cid:81)\u00af(cid:96)\n(cid:88)\ntd(cid:88)\n\n\u03b3\u2208\u0393od\n\n. . . ,\n\nr\u00af(cid:96)=1\n\ntd(cid:88)\nwith \u03bbi = a(\u03b3i\u22121, so +(cid:80)i\u22121\n\nLi(\u03b3) =\n\nr2=1\n\n\u00af(cid:96)(cid:88)\n\ni=2\n\n\u03b3\u2208\u0393od\n(\u2212\u03bbi + ri log\n\n\u03bbi\n\u03b7i\n\npmulti(r; td, \u03b7)\n\n)\n\n(15)\n\nif i \u2264 (cid:96)(\u03b3) and \u03bbi = 0 = \u03b7i if i > (cid:96)(\u03b3).\nThe exact evaluation of each Li(\u03b3) is still intractable due to the multidimensional sum. However,\nsince for any \u03b3 and i = 2, . . . , (cid:96), \u03bbi depends only on the \u2018previous\u2019 delays and we can de\ufb01ne\n\nk=2 rk) and \u03b7i = a(\u03b3i\u22121,so+\u00afti\u22121)\n\n\u00aft\u00af(cid:96)\n\n\u00afti\u22121\n\u00aft\u00af(cid:96)\n\n\u03b7past =\n\n(16)\nwhere rpast = r2 + \u00b7\u00b7\u00b7 + ri\u22121 and rfuture = ri+1 + \u00b7\u00b7\u00b7 + r\u00af(cid:96), and by the grouping property of the\nmultinomial distribution we obtain\n\n\u03bbi = a(\u03b3i\u22121, so + rpast)\n\n\u03b7future =\n\nLi(\u03b3) =\n\npmulti(r(i), td, \u03b7(i))(\u2212\u03bbi + ri log\n\n(17)\nwhere r(i) = [rpast, ri, rfuture] and \u03b7(i) = [\u03b7future, \u03b7i, \u03b7past]. Every Li(\u03b3) can now be computed in\nd) operations and the model parameter identi\ufb01ed with the solution of the following non-convex\nO(t3\noptimization problem\n\nrfuture=1\n\n\u03bbi\n\u03b7i\n\nri=1\n\n)\n\ntd(cid:88)\n\ntd(cid:88)\n\n\u00aft\u00af(cid:96) \u2212 \u00afti\n\u00aft\u00af(cid:96)\n\nN (o, d, so, sd)L(o, d, so, sd \u2212 so)\n\n(18)\n\no=1\n\nd=1\n\nso=0\n\nsd=so\n\n3Similar \u2018amortised\u2019 approaches have been used elsewhere to make the approximate inference scalable\n\n[38, 39]\n\n5\n\nD(cid:88)\n\nD(cid:88)\n\nT\u22121(cid:88)\n\nT(cid:88)\n\nmaxa,L, \u02dcL\n\n\fFigure 1: On the left, stochastic iterative solution of (18) (VI) and (13) (ML) for the synthetic dataset. At\neach iteration, the prediction error is obtained on a small out-of-sample dataset. On the right, distance from\nthe ground-truth of the uniform distribution (x-axis) and the models\u2019 path probability (y-axis) for various\norigin-destination pairs. In the legend box, total distance from the ground-truth.\n\n\u03b3\u2208\u0393od\n\nStochastic gradient descent Both (13) and (18) consist of O(D2T 2) terms and the estimation\nof the exact gradient at each iteration can be expansive for large networks D >> 1 or \ufb01ne time\nresolutions T >> 1. A common practice in this case is to use a stochastic approximation of the\ngradient where only a random selection of origin-destination pairs and starting times are used. Note\nthat each L(o, d, so, td) depends on a(x, s, \u03b3) only if the location x is crossed by at least one of the\nfeasible paths between o and d.\n\ntd=1 tdp(td) =\nppath(\u03b3)\u00aft(cid:96)(\u03b3), can be used to obtain a partial initialization of the iterative algorithms via a\nsimple moment-matching technique. We assume that, averaging over all possible starting time, the\nsystem behaves like a simple communication network with constant delays at each nodes or, equiva-\ns=0 V (x, s, \u03b3) = 0. In this case an initialization\n\nInitialization The analytic form of the \ufb01rst moments of (12), (cid:104)td(cid:105)so = (cid:80)\u221e\n(cid:80)\nlently, that a(x, s, \u03b3) = \u03b1(x) + V (x, s, \u03b3), with(cid:80)T\nD(cid:88)\n(tod \u2212 (cid:88)\nN (o, d, so, sd)(sd \u2212 so), with Z =(cid:80)T\u22121\n\nof \u03b1(x) is obtained by solving\n\n(cid:80)T\u22121\n\n(cid:96)(\u03b3)\u22121(cid:88)\n\nwhere tod = 1\nN (o, d, so, sd),\nZ\nis the \u2018averaged\u2019 empirical moments computed from the data. Note that (19) is convex for any \ufb01xed\nchoice of ppath(\u03b3).\n\nsd=so\n\nsd=so\n\nso=0\n\nso=0\n\n(cid:80)T\n\nmin\u03b1\n\n(cid:80)T\n\nD(cid:88)\n\nppath(\u03b3)\n\n\u03b1(\u03b3k))2\n\n\u03b3\u2208\u0393od\n\n(19)\n\no=1\n\nd=1\n\nk=1\n\nTotal derivatives All terms in (13) and (18) are in the form g = g(\u03be, \u00afti), where \u03be denotes the\nmodel parameters and \u00afti = \u00afti(\u03be) is de\ufb01ned by the difference equation (4). Since \u00afti is not available as\nan explicit function of \u03be it is not possible to write g = g(\u03be) or compute directly its gradient \u2207\u03beg. A\nway out is to compute the total derivative of the function g with respect to \u03be, i.e.\n\ndg(\u03be, \u00afti)\n\nd\u03be\n\n=\n\n\u2202g(\u03be, \u00afti)\n\n\u2202\u03be\n\n+\n\n\u2202g(\u03be, \u00afti)\n\nd\u00afti\nd\u03be\n\n(20)\n\nwhere d\u00afti\n\nd\u03be , for i = 1, . . . , (cid:96), can be obtained from the iterative integration of\n\n\u2202\u00afti\n\n(cid:12)(cid:12)(cid:12)(cid:12)s=\u00afti\u22121\n\nd\u00afti\u22121\nd\u03be\n\ni = 1, . . . , (cid:96)\n\n(21)\n\nd\u00afti\nd\u03be\n\n=\n\nd\u00afti\u22121\nd\u03be\n\n+\n\n\u2202a(x, s, \u03b3))\n\n\u2202\u03be\n\n+\n\n\u2202a(x, s, \u03b3))\n\n\u2202s\n\nwhich is implied by (4).\n\n4 Experiments\n\nThe method described in the previous sections is completely general and, except for the initialization\nstep, no special form of the model functions is assumed. In order to captures few key features of\n\n6\n\n012345678910log(runtime)0.150.20.250.30.350.40.450.5prediction erroroptimizationVIML00.10.20.30.40.50.60.70.8|uniform-ptrue|00.10.20.30.40.50.60.70.8|popt-ptrue|path choice probability1.96.9771\fFigure 2: On the left, travel time predicted by the VI model (in blue) and the ML model (in red) of Figure\n1 and the ground-truth model (in green) plotted against the starting time for a selection of origin-destination\npairs. In the legend box, normalised total distance ((cid:107)vexp \u2212 vtrue(cid:107)/(cid:107)vtrue(cid:107)) between model\u2019s and ground-truth\u2019s\npredictions. On the right, station loads predicted by the ground-truth (in green) and the VI model (in blue) and\nML model (in red) of Figure 1. The three models and a reduced dataset of N = 10000 true origin, destination\nand starting time records has been used to simulate the trajectories of N synthetic users. For each model, the N\nsimulated trajectories give the users expected positions at all times (the position is set to 0 if the users is not yet\ninto the system or has \ufb01nished his journey) that have been used to compute the total number of people being\nat a given station at a given time. The reported score is the total distance between model\u2019s and ground-truth\u2019s\nnormalised predictions. For station x, the normalised load vector is vx/1T vx where vx(s) is the number of\npeople being at station x at time s.\n\na large transportation system and apply the model to the tomography of the London underground,\nwe have chosen the speci\ufb01c parametrization of the function L(\u03b3) and a(x, s, \u03b3) given in Section\n4.1. The parametrised model has then been trained and tested on a series of synthetic and real-world\ndatasets as described in Section 4.2.\n\n4.1 Parametrization\n\nlength \u02dcL(\u03b3, td) in (14) was de\ufb01ned as\n\nFor each origin o and the destination d, we have reduced the set of all feasible paths, \u0393od, to a small\nset including the shortest path and few perturbations of the shortest path (by forcing different choices\nat the line-change points). Let C(\u03b3) \u2208 {0, 1}(cid:96) such that C(\u03b3i) = 1 if the user changes line at \u03b3i and\nzero otherwise. To parametrize the path probability (5) we chose L(\u03b3) = \u03b21(cid:96)(\u03b3) + \u03b22c(\u03b3) where\ni C(\u03b3i) and \u03b21, \u03b22 \u2208 R are free parameters. The posterior-corrected effective\n\n(cid:96)(\u03b3) = |\u03b3|, c(\u03b3) =(cid:80)\nwhere td is the observed travel time, \u02c6td =(cid:80)\nand j = 1, 2, 3, are extra free parameters. A regularization term \u03bb((cid:107)\u03b2(cid:107)2 +(cid:80)\n\ni = (cid:96), c (22)\nN (o, d, so, sd)(sd \u2212 so) and \u03b8ij \u2208 R, i = (cid:96), c\ni=(cid:96),c (cid:107)\u03b8i(cid:107)2), with\n\u03bb = 1/80, has been added to help the convergence of the stochastic algorithm. We let the local\ntime-dependent delay at location x and time s be a(x, s, \u03b3) = softplus(\u03b1(x) + V (x, s) + W (x, \u03b3))\nwith\n\n\u02dc\u03b2i = \u03b8i1 + \u03b8i2u + \u03b8i3u\u22121\n\nu = \u02c6t\u22122\n\nd (td \u2212 \u02c6td)2\n\n\u02dcL(\u03b3) = \u02dc\u03b2(cid:96)(cid:96)(\u03b3) + \u02dc\u03b2cc(\u03b3)\n\no,d,so,sd\n\n(cid:96)(cid:88)\n\nN\u03c9(cid:88)\n\nN\u03c6(cid:88)\n\nV =\n\n\u03c3ij(x) cos (\u03c9is + \u03c6j)\n\nW =\n\n\u03c1(x)\u03b4x,\u03b3iC(\u03b3i) + \u03b7(x) (\u03b4x,\u03b31 + \u03b4x,\u03b3(cid:96))\n\n(23)\n\ni=1\n\nj=1\n\ni=1\n\nwhere \u03b1(x), \u03c1(x), \u03b7(x) \u2208 R and \u03c3(x) \u2208 RN\u03c9\u00d7N\u03c6 are free parameters and {\u03c91, . . . \u03c9N\u03c9} and\n{\u03c61, . . . \u03c6N\u03c6} two sets of library frequencies and phases. In the synthetic simulation, we have\nrestricted the London underground network [36] to Zone 1 (63 stations), chosen N\u03c9 = 5 = N\u03c8 and\nset W = 0. For the real-data experiments we have considered Zone 1 and 2 (131 stations), N\u03c9 = 10,\nN\u03c8 = 5 and W (cid:54)= 0.\n\n7\n\n020040060080010001200starting time05101520253035exp travel timeKings Cross LU to Oxford Circustrue0.29910.2376020040060080010001200starting time05101520253035404550exp travel timeOxford Circus to Waterloo LUtrue0.091520.09095020040060080010001200starting time01020304050607080exp travel timeWaterloo LU to Paddington LUtrue0.11920.1131020040060080010001200starting time05101520253035404550exp travel timePaddington LU to Kings Cross LUtrue0.17150.1486020040060080010001200time01020304050# of peopleKings Cross LU020040060080010001200time01020304050# of people0.0165020040060080010001200time01020304050# of people0.01263020040060080010001200time01020304050# of peopleOxford Circus020040060080010001200time01020304050# of people0.0151020040060080010001200time01020304050# of people0.01293020040060080010001200time01020304050# of peoplePaddington LU020040060080010001200time01020304050# of people0.01355020040060080010001200time01020304050# of people0.013411\fFigure 3: Travel times predicted by a random model (top), the initialization model (middle) obtained from (19)\nand the ML model (bottom) are scattered against the observed travel times of an out-of-sample test dataset (real\ndata). The plots in the \ufb01rst three columns show the prediction-error of each model on three subsets of the test\nsample, Sshort (\ufb01rst column), Smedium (second column) and Slong (third column), consisting respectively of\nshort, medium-length and long journeys. The plots in the last column show the prediction error of each model\non the whole test dataset Sall = Sshort + Smedium + Slong The reported score is the relative prediction error for\nthe corresponding model and subset of journeys de\ufb01ned as (cid:107)vexp \u2212 vtrue(cid:107)/(cid:107)vtrue(cid:107), with vexp(n) and vtrue(n)\nbeing the expected and observed travel times for the nth journey in Si, i \u2208 {short, medium, long, all}.\n\nFigure 4: Station loads obtained from NetMIS data (in blue) and predicted by the model (in red). NetMIS\ndata contain information about the time period during which a train was at the station and an approximate\nweight-score of the train. At time s, a proxy of the load at a given station is obtained by summing the score of\nall trains present at that station at time s. To make the weight scores and the model predictions comparable we\nhave divided both quantities by the area under the corresponding plots (proportional to the number of people\ntravelling through the selected stations during the day). The reported score is the relative prediction error\n(cid:107)vexp \u2212 vtrue(cid:107)/(cid:107)vtrue(cid:107), with vexp(s) being the (normalised) expected number of people being at the station at\ntime s and vtrue(s) the (normalised) weight-score obtained from the NetMIS data.\n\n8\n\n01020304050607080true travel time01020304050607080predicted travel time1.087301020304050607080true travel time020406080100120predicted travel time0.904301020304050607080true travel time020406080100120predicted travel time0.5966101020304050607080true travel time020406080100120predicted travel time0.89801020304050607080true travel time01020304050607080predicted travel time0.2869301020304050607080true travel time01020304050607080predicted travel time0.2248801020304050607080true travel time01020304050607080predicted travel time0.3517301020304050607080true travel time01020304050607080predicted travel time0.263401020304050607080true travel time01020304050607080predicted travel time0.2054401020304050607080true travel time01020304050607080predicted travel time0.1885501020304050607080true travel time01020304050607080predicted travel time0.3599901020304050607080true travel time01020304050607080predicted travel time0.23084006008001000120014001600time02468observed load#10-4Euston LU4006008001000120014001600time02468predicted load#10-40.69864006008001000120014001600time02468observed load#10-4Finsbury Park4006008001000120014001600time02468predicted load#10-40.79824006008001000120014001600time02468observed load#10-4Green Park4006008001000120014001600time02468predicted load#10-40.93114006008001000120014001600time02468observed load#10-4Kings Cross LU4006008001000120014001600time02468predicted load#10-40.59594006008001000120014001600time02468observed load#10-4Oxford Circus4006008001000120014001600time02468predicted load#10-40.63114006008001000120014001600time02468observed load#10-4Stockwell4006008001000120014001600time02468predicted load#10-41.4534006008001000120014001600time02468observed load#10-4Victoria LU4006008001000120014001600time02468predicted load#10-40.58594006008001000120014001600time02468observed load#10-4Warren Street4006008001000120014001600time02468predicted load#10-40.71614006008001000120014001600time02468observed load#10-4Pimlico4006008001000120014001600time02468predicted load#10-40.6774006008001000120014001600time02468observed load#10-4Vauxhall LU4006008001000120014001600time02468predicted load#10-40.6172\f4.2 Methods and discussion\n\nSynthetic and real-world numerical experiments have been performed to: (i) understand how reliable\nis the proposed approximation method compared to more standard approach (variational inference),\n(ii) provide quantitative tests of our inference algorithm on the prediction of key unobservable\nquantities from a ground-truth model and (iii) assess the scalability and applicability of our method\nby modelling the traf\ufb01c of a large-scale real-world system. Both synthetic and real-world experiments\nwere are based on the London underground network [36]. Synthetic data were generated from the true\norigins, destinations and starting times by simulating the trajectories with the ground-truth (random)\nmodel described in Section 4.1. On such dataset, we have compared the training performance of the\nvariational inference and the maximum likelihood approaches by measuring the prediction error on\nan out-of-sample dataset at each stochastic iteration (Figure 1, right). The two trained models have\nthen been tested against the ground-truth on predicting (i) the total travel time (Figure 2, left), (ii) the\nshape of the users\u2019 path preferences (Figure 1, right) and (iii) the local loads (Figure 2, right). In the\nreal-world experiments, we have trained the model on a dataset of smart-card origin-destination data\n(pseudonymised Oyster Card records from 21st October 2013 provided by Transport for London4 )\nand then tested the prediction of the total travel time on a small out-of-sample set of journeys (Figure\n3) . In this case we have compared the model prediction with its indirect estimation obtained from\nNetMIS records of the same day (Figure 4). NetMIS data contain a partial reconstruction of the actual\nposition and weights of the trains and it is possible to combine them to estimate the load of a given\nstation an any given time in the day. Since full train information was recorded only on one of the 11\nunderground lines of the network (the Victoria Line), we have restricted the comparison to a small set\nof stations.\nThe two inference methods (VI for (18) and ML for (13)) have obtained good and statistically similar\nscores on recovering the ground-truth model predictions (Figure 2). ML has been trained orders\nof magnitude faster and was almost as accurate as VI on reproducing the users\u2019 path preferences\n(see Figure 1). Since the performance of ML and VI have shown to be statistically equivalent.\nOnly ML has been used in the real-data experiments. On the prediction of out-of-sample travel\ntimes, ML outperformed both a random model and the constant model used for the initialization\n(a(x, s, \u03b3) = \u03b1(x) with \u03b1(x) obtained from (19) with uniform ppath). In particular, when all journeys\nin the test dataset are considered, ML outperforms the baseline method with a 24% improvement.\nThe only sub-case where ML does worse ( 8% less accurate) is on the small subset of long journeys\n(see Figure 3). These are journeys where i) something unusual happens to the user or ii) the user visits\nlot of stations. In the latter case, a constant-delay model (as our initialization model) may perform\nwell because we can expect some averaging process between the time variability of all visited stations.\nFigure 4 shows that ML was able to reproduce the shape and relative magnitude of the \u2018true\u2019 time\ndistributions obtained from the NetMIS data. For a more quantitative comparison, we have computed\nthe normalised distance (reported on the top of the red plots in Figure 4) between observed and\npredicted loads over the day.\n\n5 Conclusions\n\nWe have proposed a new scalable method for the tomography of large-scale networks from input\noutput data. Based on the prediction of the users\u2019 travel time, the model allows an estimation of\nthe unobserved path preferences and station loads. Since the original model is intractable, we have\nproposed and compared two different approximate inference schemes. The model hes been tested on\nboth synthetic and real data from the London underground. On synthetic data, we have trained two\ndistinct models with the proposed approximate inference techniques and compare their performance\nagainst the ground-truth. Both of them could successfully reproduce the outputs of the ground-truth\non observable and unobservable quantities. Trained on real data via stochastic gradient descent,\nthe model outperforms a simple constant-delay model on predicting out-of-sample travel times and\nproduces reasonable estimation of the unobserved station loads. In general, the training step could be\nmade more ef\ufb01cient by a careful design of the mini-batches used in the stochastic optimization. More\nprecisely, since each term in (13) or (18) involves only a very restricted set of parameters (depending\non the set feasible paths between the corresponding origin and destination), the inference could be\nradically improved by strati\ufb01ed sampling techniques as described for example in [40, 41, 42].\n\n4 The data shown in Figure 3 and 4 are not publicly available, but a reduced database containing similar\n\nrecords can be downloaded from [19]\n\n9\n\n\fAcknowledgments\n\nWe thank Transport for London for kindly providing access to data. This work has been funded by a\nEPSRC grant EP/N020723/1. RS also acknowledges support by The Alan Turing Institute under the\nEPSRC grant EP/N510129/1 and the Alan Turing Institute-Lloyd\u2019s Register Foundation programme\non Data-Centric Engineering.\n\nReferences\n[1] Everett M Rogers and D Lawrence Kincaid. Communication networks: toward a new paradigm\n\nfor research. 1981.\n\n[2] Stanley Wasserman and Katherine Faust. Social network analysis: Methods and applications,\n\nvolume 8. Cambridge university press, 1994.\n\n[3] Michael GH Bell and Yasunori Iida. Transportation network analysis. 1997.\n\n[4] Mark EJ Newman. The structure and function of complex networks. SIAM review, 45(2):167\u2013\n\n256, 2003.\n\n[5] Mark Newman, Albert-Laszlo Barabasi, and Duncan J Watts. The structure and dynamics of\n\nnetworks. Princeton University Press, 2011.\n\n[6] Nicholas A Christakis and James H Fowler. Social contagion theory: examining dynamic social\n\nnetworks and human behavior. Statistics in medicine, 32(4):556\u2013577, 2013.\n\n[7] Mark Coates, Alfred Hero, Robert Nowak, and Bin Yu. Large scale inference and tomography\n\nfor network monitoring and diagnosis. IEEE Signal Processing Magazine, 2001.\n\n[8] Edoardo M Airoldi and Alexander W Blocker. Estimating latent processes on a network from\nindirect measurements. Journal of the American Statistical Association, 108(501):149\u2013164,\n2013.\n\n[9] Yehuda Vardi. Network tomography: Estimating source-destination traf\ufb01c intensities from link\n\ndata. Journal of the American statistical association, 91(433):365\u2013377, 1996.\n\n[10] Rui Castro, Mark Coates, Gang Liang, Robert Nowak, and Bin Yu. Network tomography:\n\nRecent developments. Statistical science, pages 499\u2013517, 2004.\n\n[11] Luis G Willumsen. Estimation of an od matrix from traf\ufb01c counts\u2013a review. 1978.\n\n[12] Nathan Eagle, Alex Sandy Pentland, and David Lazer. Inferring friendship network structure\nby using mobile phone data. Proceedings of the national academy of sciences, 106(36):15274\u2013\n15278, 2009.\n\n[13] Yu Zheng, Licia Capra, Ouri Wolfson, and Hai Yang. Urban computing: concepts, method-\nologies, and applications. ACM Transactions on Intelligent Systems and Technology (TIST),\n5(3):38, 2014.\n\n[14] Robert J Vanderbei and James Iannone. An em approach to od matrix estimation. Technical\n\nreport, Technical Report SOR 94-04, Princeton University, 1994.\n\n[15] Claudia Tebaldi and Mike West. Bayesian inference on network traf\ufb01c using link count data.\n\nJournal of the American Statistical Association, 93(442):557\u2013573, 1998.\n\n[16] Jin Cao, Drew Davis, Scott Vander Wiel, and Bin Yu. Time-varying network tomography:\n\nrouter link data. Journal of the American statistical association, 95(452):1063\u20131075, 2000.\n\n[17] Yolanda Tsang, Mark Coates, and Robert Nowak. Nonparametric internet tomography. In\nAcoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on,\nvolume 2, pages II\u20132045. IEEE, 2002.\n\n[18] Ricardo Silva, Soong Moon Kang, and Edoardo M Airoldi. Predicting traf\ufb01c volumes and\nestimating the effects of shocks in massive transportation systems. Proceedings of the National\nAcademy of Sciences, 112(18):5643\u20135648, 2015.\n\n10\n\n\f[19] Transport For London. Of\ufb01cial website. https://t\ufb02.gov.uk/.\n\n[20] Camille Roth, Soong Moon Kang, Michael Batty, and Marc Barth\u00e9lemy. Structure of urban\nmovements: polycentric activity and entangled hierarchical \ufb02ows. PloS one, 6(1):e15923, 2011.\n\n[21] Chen Zhong, Michael Batty, Ed Manley, Jiaqiu Wang, Zijia Wang, Feng Chen, and Gerhard\nSchmitt. Variability in regularity: Mining temporal mobility patterns in london, singapore and\nbeijing using smart-card data. PloS one, 11(2):e0149222, 2016.\n\n[22] Ram\u00f3n C\u00e1ceres, Nick G Duf\ufb01eld, Joseph Horowitz, and Donald F Towsley. Multicast-based\ninference of network-internal loss characteristics. IEEE Transactions on Information theory,\n45(7):2462\u20132480, 1999.\n\n[23] Mark J Coates and Robert David Nowak. Network loss inference using unicast end-to-end\nmeasurement. In ITC Conference on IP Traf\ufb01c, Modeling and Management, pages 28\u20131, 2000.\n\n[24] F Lo Presti, Nick G Duf\ufb01eld, Joseph Horowitz, and Don Towsley. Multicast-based inference of\nnetwork-internal delay distributions. IEEE/ACM Transactions On Networking, 10(6):761\u2013775,\n2002.\n\n[25] Llewellyn Michael Kraus Boelter and Melville Campbell Branch. Urban planning, transporta-\ntion, and systems analysis. Proceedings of the National Academy of Sciences, 46(6):824\u2013831,\n1960.\n\n[26] Jayanth R Banavar, Amos Maritan, and Andrea Rinaldo. Size and form in ef\ufb01cient transportation\n\nnetworks. Nature, 399(6732):130\u2013132, 1999.\n\n[27] Haodong Yin, Baoming Han, Dewei Li, Jianjun Wu, and Huijun Sun. Modeling and simulating\npassenger behavior for a station closure in a rail transit network. PLoS one, 11(12):e0167126,\n2016.\n\n[28] Junbo Zhang, Yu Zheng, and Dekang Qi. Deep spatio-temporal residual networks for citywide\n\ncrowd \ufb02ows prediction. arXiv preprint arXiv:1610.00081, 2016.\n\n[29] Akshat Kumar, Daniel Sheldon, and Biplav Srivastava. Collective diffusion over networks:\n\nModels and inference. arXiv preprint arXiv:1309.6841, 2013.\n\n[30] Jiali Du, Akshat Kumar, and Pradeep Varakantham. On understanding diffusion dynamics of\npatrons at a theme park. In Proceedings of the 2014 international conference on Autonomous\nagents and multi-agent systems, pages 1501\u20131502. International Foundation for Autonomous\nAgents and Multiagent Systems, 2014.\n\n[31] Maciej Kurant and Patrick Thiran. Layered complex networks. Physical review letters,\n\n96(13):138701, 2006.\n\n[32] Yu Zheng and Xiaofang Zhou. Computing with spatial trajectories. Springer Science and\n\nBusiness Media, 2011.\n\n[33] A Nuzzolo, U Crisalli, L Rosati, and A Ibeas. Stop: a short term transit occupancy prediction\ntool for aptis and real time transit management systems. In Intelligent Transportation Systems-\n(ITSC), 2013 16th International IEEE Conference on, pages 1894\u20131899. IEEE, 2013.\n\n[34] Bo Friis Nielsen, Laura Fr\u00f8lich, Otto Anker Nielsen, and Dorte Filges. Estimating passenger\nnumbers in trains using existing weighing capabilities. Transportmetrica A: Transport Science,\n10(6):502\u2013517, 2014.\n\n[35] Gilles Vandewiele, Pieter Colpaert, Olivier Janssens, Joachim Van Herwegen, Ruben Verborgh,\nErik Mannens, Femke Ongenae, and Filip De Turck. Predicting train occupancies based on\nquery logs and external data sources. In Proceedings of the 26th International Conference on\nWorld Wide Web Companion, pages 1469\u20131474. International World Wide Web Conferences\nSteering Committee, 2017.\n\n[36] Transport For London. Tube map. https://t\ufb02.gov.uk/cdn/static/cms/documents/standard-tube-\n\nmap.pdf.\n\n11\n\n\f[37] Transport For London. Netmis dataset.\n\nhttp://lu.uat.cds.co.uk/Ops_maintenance/Library_tools/Apps_tools/696.html.\n\n[38] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[39] Samuel Gershman and Noah Goodman. Amortized inference in probabilistic reasoning. In\n\nCogSci, 2014.\n\n[40] Prem K Gopalan, Sean Gerrish, Michael Freedman, David M Blei, and David M Mimno.\nScalable inference of overlapping communities. In Advances in Neural Information Processing\nSystems, pages 2249\u20132257, 2012.\n\n[41] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized\nloss minimization. In Proceedings of the 32nd International Conference on Machine Learning\n(ICML-15), pages 1\u20139, 2015.\n\n[42] Olivier Can\u00e9vet, Cijo Jose, and Francois Fleuret. Importance sampling tree for large-scale\nempirical expectation. In International Conference on Machine Learning, pages 1454\u20131462,\n2016.\n\n12\n\n\f", "award": [], "sourceid": 1741, "authors": [{"given_name": "Nicol\u00f2", "family_name": "Colombo", "institution": "University College London"}, {"given_name": "Ricardo", "family_name": "Silva", "institution": "University College London"}, {"given_name": "Soong Moon", "family_name": "Kang", "institution": "University College London"}]}