{"title": "Distributed Estimation, Information Loss and Exponential Families", "book": "Advances in Neural Information Processing Systems", "page_first": 1098, "page_last": 1106, "abstract": "Distributed learning of probabilistic models from multiple data repositories with minimum communication is increasingly important. We study a simple communication-efficient learning framework that first calculates the local maximum likelihood estimates (MLE) based on the data subsets, and then combines the local MLEs to achieve the best possible approximation to the global MLE, based on the whole dataset jointly. We study the statistical properties of this framework, showing that the loss of efficiency compared to the global setting relates to how much the underlying distribution families deviate from full exponential families, drawing connection to the theory of information loss by Fisher, Rao and Efron. We show that the full-exponential-family-ness\" represents the lower bound of the error rate of arbitrary combinations of local MLEs, and is achieved by a KL-divergence-based combination method but not by a more common linear combination method. We also study the empirical properties of the KL and linear combination methods, showing that the KL method significantly outperforms linear combination in practical settings with issues such as model misspecification, non-convexity, and heterogeneous data partitions.\"", "full_text": "Distributed Estimation, Information Loss and\n\nExponential Families\n\nQiang Liu\n\nAlexander Ihler\n\nDepartment of Computer Science, University of California, Irvine\n\nqliu1@uci.edu\n\nihler@ics.uci.edu\n\nAbstract\n\nDistributed learning of probabilistic models from multiple data repositories\nwith minimum communication is increasingly important. We study a simple\ncommunication-ef\ufb01cient learning framework that \ufb01rst calculates the local maxi-\nmum likelihood estimates (MLE) based on the data subsets, and then combines\nthe local MLEs to achieve the best possible approximation to the global MLE\ngiven the whole dataset. We study this framework\u2019s statistical properties, showing\nthat the ef\ufb01ciency loss compared to the global setting relates to how much the un-\nderlying distribution families deviate from full exponential families, drawing con-\nnection to the theory of information loss by Fisher, Rao and Efron. We show that\nthe \u201cfull-exponential-family-ness\u201d represents the lower bound of the error rate of\narbitrary combinations of local MLEs, and is achieved by a KL-divergence-based\ncombination method but not by a more common linear combination method. We\nalso study the empirical properties of both methods, showing that the KL method\nsigni\ufb01cantly outperforms linear combination in practical settings with issues such\nas model misspeci\ufb01cation, non-convexity, and heterogeneous data partitions.\n\n1\n\nIntroduction\n\nModern data-science applications increasingly require distributed learning algorithms to extract in-\nformation from many data repositories stored at different locations with minimal interaction. Such\ndistributed settings are created due to high communication costs (for example in sensor networks),\nor privacy and ownership issues (such as sensitive medical or \ufb01nancial data). Traditional algorithms\noften require access to the entire dataset simultaneously, and are not suitable for distributed settings.\nWe consider a straightforward two-step procedure for distributed learning that follows a \u201cdivide and\nconquer\u201d strategy: (i) local learning, which involves learning probabilistic models based on the local\ndata repositories separately, and (ii) model combination, where the local models are transmitted\nto a central node (the \u201cfusion center\u201d), and combined to form a global model that integrates the\ninformation in the local repositories. This framework only requires transmitting the local model\nparameters to the fusion center once, yielding signi\ufb01cant advantages in terms of both communication\nand privacy constraints. However, the two-step procedure may not fully extract all the information in\nthe data, and may be less (statistically) ef\ufb01cient than a corresponding centralized learning algorithm\nthat operates globally on the whole dataset. This raises important challenges in understanding the\nfundamental statistical limits of the local learning framework, and proposing optimal combination\nmethods to best approximate the global learning algorithm.\nIn this work, we study these problems in the setting of estimating generative model parameters\nfrom a distribution family via the maximum likelihood estimator (MLE). We show that the loss of\nstatistical ef\ufb01ciency caused by using the local learning framework is related to how much the under-\nlying distribution families deviate from full exponential families: local learning can be as ef\ufb01cient\nas (in fact exactly equivalent to) global learning on full exponential families, but is less ef\ufb01cient\non non-exponential families, depending on how nearly \u201cfull exponential family\u201d they are. The\n\n1\n\n\f\u201cfull-exponential-family-ness\u201d is formally captured by the statistical curvature originally de\ufb01ned\nby Efron (1975), and is a measure of the minimum loss of Fisher information when summarizing\nthe data using \ufb01rst order ef\ufb01cient estimators (e.g., Fisher, 1925, Rao, 1963). Speci\ufb01cally, we show\nthat arbitrary combinations of the local MLEs on the local datasets can approximate the global MLE\non the whole dataset at most up to an asymptotic error rate proportional to the square of the sta-\ntistical curvature. In addition, a KL-divergence-based combination of the local MLEs achieves this\nminimum error rate in general, and exactly recovers the global MLE on full exponential families.\nIn contrast, a more widely-used linear combination method does not achieve the optimal error rate,\nand makes mistakes even on full exponential families. We also study the two methods empirically,\nexamining their robustness against practical issues such as model mis-speci\ufb01cation, heterogeneous\ndata partitions, and the existence of hidden variables (e.g., in the Gaussian mixture model). These\nissues often cause the likelihood to have multiple local optima, and can easily degrade the linear\ncombination method. On the other hand, the KL method remains robust in these practical settings.\nRelated Work. Our work is related to Zhang et al. (2013a), which includes a theoretical analysis\nfor linear combination. Merugu and Ghosh (2003, 2006) proposed the KL combination method in\nthe setting of Gaussian mixtures, but without theoretical analysis. There are many recent theoretical\nworks on distributed learning (e.g., Predd et al., 2007, Balcan et al., 2012, Zhang et al., 2013b,\nShamir, 2013), but most focus on discrimination tasks like classi\ufb01cation and regression. There are\nalso many works on distributed clustering (e.g., Merugu and Ghosh, 2003, Forero et al., 2011, Liang\net al., 2013) and distributed MCMC (e.g., Scott et al., 2013, Wang and Dunson, 2013, Neiswanger\net al., 2013). An orthogonal setting of distributed learning is when the data is split across the variable\ndimensions, instead of the data instances; see e.g., Liu and Ihler (2012), Meng et al. (2013).\n\n2 Problem Setting\n\nAssume we have an i.i.d. sample X = {xi\u2236 i = 1, . . . , n}, partitioned into d sub-samples X k =\n{xi\u2236 i\u2208 \u03b1k} that are stored in different locations, where\u222ad\nk=1\u03b1k=[n]. For simplicity, we assume\nthe data are equally partitioned, so that each group has n~d instances; extensions to the more general\na distribution family{p(x\u03b8)\u2236 \u03b8\u2208 \u0398}. Let \u03b8\n\u2217 be the true unknown parameter. We are interested in\ni\u2208[n] log p(xi\u03b8).\n\u02c6\u03b8mle= arg max\nQ\n\u03b8\u2208\u0398\n\n\u2217 via the maximum likelihood estimator (MLE) based on the whole sample,\n\ncase is straightforward. Assume X is drawn i.i.d. from a distribution with an unknown density from\n\nestimating \u03b8\n\nHowever, directly calculating the global MLE often requires distributed optimization algorithms\n(such as ADMM (Boyd et al., 2011)) that need iterative communication between the local reposito-\nries and the fusion center, which can signi\ufb01cantly slow down the algorithm regardless of the amount\nof information communicated at each iteration. We instead approximate the global MLE by a two-\nstage procedure that calculates the local MLEs separately for each sub-sample, then sends the local\nMLEs to the fusion center and combines them. Speci\ufb01cally, the k-th sub-sample\u2019s local MLE is\n\nand we want to construct a combination function f(\u02c6\u03b81, . . . , \u02c6\u03b8d)\u2192 \u02c6\u03b8f to form the best approximation\n\nto the global MLE \u02c6\u03b8mle. Perhaps the most straightforward combination is the linear average,\n\n\u02c6\u03b8k= arg max\n\u03b8\u2208\u0398\n\nQ\ni\u2208\u03b1k\n\u02c6\u03b8linear= 1\n\nlog p(xi\u03b8),\nQ\n\n\u02c6\u03b8k.\n\nd\n\nk\n\nLinear-Averaging:\n\nHowever, this method is obviously limited to continuous and additive parameters; in the sequel, we\nillustrate it also tends to degenerate in the presence of practical issues such as non-convexity and\nnon-i.i.d. data partitions. A better combination method is to average the models w.r.t. some distance\nmetric, instead of the parameters. In particular, we consider a KL-divergence based averaging,\n\n\u02c6\u03b8KL= arg min\n\u03b8\u2208\u0398\n\nQ\n\nk\n\nKL(p(x\u02c6\u03b8k) p(x\u03b8)).\n\nfrom each local model p(x\u02c6\u03b8k), and then estimates a global MLE based on all the combined\n\nThe estimate \u02c6\u03b8KL can also be motivated by a parametric bootstrap procedure that \ufb01rst draws sample\nX k\n\n\u2032\n\nKL-Averaging:\n\n(1)\n\n2\n\n\f\u2032={X k\n\n\u2032\u2236 k\u2208[d]}. We can readily show that this reduces to \u02c6\u03b8KL as the size\n\nbootstrap samples X\nof the bootstrapped samples X k\ndistance metrics are also possible, but may not have a similarly natural interpretation.\n\ngrows to in\ufb01nity. Other combination methods based on different\n\n\u2032\n\n3 Exactness on Full Exponential Families\n\n\u03b8\u2208 \u0398\u2261{\u03b8\u2208 Rm\u2236S\n\nIn this section, we analyze the KL and linear combination methods on full exponential families.\nWe show that the KL combination of the local MLEs exactly equals the global MLE, while the\nlinear average does not in general, but can be made exact by using a special parameterization. This\nsuggests that distributed learning is in some sense \u201ceasy\u201d on full exponential families.\nDe\ufb01nition 3.1. (1). A family of distributions is said to be a full exponential family if its density can\nbe represented in a canonical form (up to one-to-one transforms of the parameters),\n\nexp(\u03b8T \u03c6(x))dH(x)<\u221e}.\np(x\u03b8)= exp(\u03b8T \u03c6(x)\u2212 log Z(\u03b8)),\nwhere \u03b8=[\u03b81, . . . \u03b8m]T and \u03c6(x)=[\u03c61(x), . . . \u03c6m(x)]T are called the natural parameters and the\nnatural suf\ufb01cient statistics, respectively. The quantity Z(\u03b8) is the normalization constant, and H(x)\nis the reference measure. An exponential family is said to be minimal if[1, \u03c61(x), . . . \u03c6m(x)]T is\nlinearly independent, that is, there is no non-zero constant vector \u03b1, such that \u03b1T \u03c6(x)= 0 for all x.\nTheorem 3.2. IfP={p(x\u03b8)\u2236 \u03b8\u2208 \u0398} is a full exponential family, then the KL-average \u02c6\u03b8KL always\nexactly recovers the global MLE, that is, \u02c6\u03b8KL= \u02c6\u03b8mle. Further, ifP is minimal, we have\nwhere \u00b5\u2236 \u03b8\u0015 E\u03b8[\u03c6(x)] is the one-to-one map from the natural parameters to the moment param-\n\n\u22121 is the inverse map of \u00b5. Note that we have \u00b5(\u03b8)= \u2202log Z(\u03b8)~\u2202\u03b8.\n\n\u22121\u0004 \u00b5(\u02c6\u03b81)+\u0016+ \u00b5(\u02c6\u03b8d)\n\n\u02c6\u03b8KL= \u00b5\n\neters, and \u00b5\n\n\u0004 ,\n\n(2)\n\nd\n\nx\n\nProof. Directly verify that the KL objective in (1) equals the global negative log-likelihood.\n\nwhich then exactly maps to the global MLE.\n\nThe nonlinear average in (2) gives an intuitive interpretation of why \u02c6\u03b8KL equals \u02c6\u03b8mle on full expo-\n\nnential families: it \ufb01rst calculates the local empirical moment parameters \u00b5(\u02c6\u03b8k)= d~n\u2211i\u2208\u03b1k \u03c6(xk);\naveraging them gives the empirical moment parameter on the whole data \u02c6\u00b5n = 1~n\u2211i\u2208[n] \u03c6(xk),\nEq (2) also suggests that \u02c6\u03b8linear would be exact only if \u00b5(\u22c5) is an identity map. Therefore, one may\nmake \u02c6\u03b8linear exact by using the special parameterization \u03d1= \u00b5(\u03b8). In contrast, KL-averaging will\ndistribution. Let \u02c6sk = (d~n)\u2211i\u2208\u03b1k(xi)2 be the empirical variance on the k-th sub-sample and\n\u02c6s = \u2211k \u02c6sk~d the overall empirical variance. Then, \u02c6\u03b8linear would correspond to different power\n\nmake this reparameterization automatically (\u00b5 is different on different exponential families). Note\nthat both KL-averaging and global MLE are invariant w.r.t. one-to-one transforms of the parameter\n\u03b8, but linear averaging is not.\nExample 3.3 (Variance Estimation). Consider estimating the variance \u03c32 of a zero-mean Gaussian\n\nmeans on \u02c6sk, depending on the choice of parameterization, e.g.,\n\n1\nd\n\n\u02c6\u03b8linear\n\nwhere only the linear average of \u02c6sk (when \u03b8= \u03c32) matches the overall empirical variance \u02c6s and\n\nequals the global MLE. In contrast, \u02c6\u03b8KL always corresponds to a linear average of \u02c6sk, equaling the\nglobal MLE, regardless of the parameterization.\n\n1\nd\n\n1\nd\n\n3\n\n\u03b8= \u03c32 (variance)\n\u2211k \u02c6sk\n\n\u03b8= \u03c3 (standard deviation)\n\n\u03b8= \u03c3\n\n\u2211k\n\n(\u02c6sk)1~2\n\n\u22122 (precision)\n(\u02c6sk)\u22121\n\u2211k\n\n\f4\n\nInformation Loss in Distributed Learning\n\nThe exactness of \u02c6\u03b8KL in Theorem 3.2 is due to the beauty (or simplicity) of exponential families.\nFollowing Efron\u2019s intuition, full exponential families can be viewed as \u201cstraight lines\u201d or \u201clinear\nsubspaces\u201d in the space of distributions, while other distribution families correspond to \u201ccurved\u201d sets\nof distributions, whose deviation from full exponential families can be measured by their statistical\ncurvatures as de\ufb01ned by Efron (1975). That work shows that statistical curvature is closely related\nto Fisher and Rao\u2019s theory of second order ef\ufb01ciency (Fisher, 1925, Rao, 1963), and represents\nthe minimum information loss when summarizing the data using \ufb01rst order ef\ufb01cient estimators. In\nthis section, we connect this classical theory with the local learning framework, and show that the\nstatistical curvature also represents the minimum asymptotic deviation of arbitrary combinations of\nthe local MLEs to the global MLE, and that this is achieved by the KL combination method, but not\nin general by the simpler linear combination method.\n\n4.1 Curved Exponential Families and Statistical Curvature\n\n(3)\n\nits density can be represented as\n\nFollowing Kass and Vos (2011), we assume some regularity conditions for our asymptotic analysis.\n\np(x\u03b8)= exp(\u03b7(\u03b8)T \u03c6(x)\u2212 log Z(\u03b7(\u03b8))),\n\nWe follow the convention in Efron (1975), and illustrate the idea of statistical curvature using curved\nexponential families, which are smooth sub-families of full exponential families. The theory can be\nnaturally extended to more general families (see e.g., Efron, 1975, Kass and Vos, 2011).\n\nDe\ufb01nition 4.1. A family of distributions{p(x\u03b8)\u2236 \u03b8\u2208 \u0398} is said to be a curved exponential family if\nwhere the dimension of \u03b8=[\u03b81, . . . , \u03b8q] is assumed to be smaller than that of \u03b7=[\u03b71, . . . , \u03b7m] and\n\u03c6=[\u03c61, . . . , \u03c6m], that is q< m.\nAssume \u0398 is an open set in Rq, and the mapping \u03b7\u2236 \u0398\u2192 \u03b7(\u0398) is one-to-one and in\ufb01nitely differ-\nentiable, and of rank q, meaning that the q\u00d7 m matrix \u02d9\u03b7(\u03b8) has rank q everywhere. In addition,\nif a sequence{\u03b7(\u03b8i)\u2208 N0} converges to a point \u03b7(\u03b80), then{\u03b7i\u2208 \u0398} must converge to \u03c6(\u03b70). In\ngeometric terminology, such a map \u03b7\u2236 \u0398\u2192 \u03b7(\u0398) is called a q-dimensional embedding in Rm.\np(x\u03b7)= exp(\u03b7T \u03c6(x)\u2212 log Z(\u03b7)), with \u03b7 constrained in \u03b7(\u0398). If \u03b7(\u03b8) is a linear function, then\notherwise, \u03b7(\u03b8) is a curved subset in the \u03b7-space, whose curvature \u2013 its deviation from planes or\nConsider the case when \u03b8 is a scalar, and hence \u03b7(\u03b8) is a curve; the ge-\nometric curvature \u03b3\u03b8 of \u03b7(\u03b8) at point \u03b8 is de\ufb01ned to be the reciprocal of\nthe radius of the circle that \ufb01ts best to \u03b7(\u03b8) locally at \u03b8. Therefore, the\ncurvature of a circle of radius r is a constant 1~r. In general, elementary\n\u03b8 \u02d9\u03b7\u03b8)2). The statis-\nDe\ufb01nition 4.2 (Statistical Curvature). Consider a curved exponential familyP={p(x\u03b8)\u2236 \u03b8\u2208 \u0398},\nwhose parameter \u03b8 is a scalar (q= 1). Let \u03a3\u03b8= cov\u03b8[\u03c6(x)] be the m\u00d7 m Fisher information on\nthe corresponding full exponential family p(x\u03b7). The statistical curvature ofP at \u03b8 is de\ufb01ned as\n\ncalculus shows that \u03b32\n\u03b8\ntical curvature of a curved exponential family is de\ufb01ned similarly, except\nequipped with an inner product de\ufb01ned via its Fisher information metric.\n\nObviously, a curved exponential family can be treated as a smooth subset of a full exponential family\n\nthe curved exponential family can be rewritten into a full exponential family in lower dimensions;\n\nstraight lines \u2013 represents its deviation from full exponential families.\n\n=( \u02d9\u03b7T\n\u03b8 \u02d9\u03b7\u03b8)\u22123(\u00a8\u03b7T\n\n\u03b8 \u00a8\u03b7\u03b8\u22c5 \u02d9\u03b7T\n\n\u03b8 \u02d9\u03b7\u03b8\u2212(\u00a8\u03b7T\n\n\u03b8 \u03a3\u03b8 \u02d9\u03b7\u03b8)\u22123\u0001(\u00a8\u03b7T\n=( \u02d9\u03b7T\n\n\u03b8 \u03a3\u03b8 \u00a8\u03b7\u03b8)\u22c5( \u02d9\u03b7T\n\n\u03b8 \u03a3\u03b8 \u02d9\u03b7\u03b8)\u2212(\u00a8\u03b7T\n\n\u03b8 \u03a3\u03b8 \u02d9\u03b7\u03b8)2(cid:6).\n\n\u03b32\n\u03b8\n\nThe de\ufb01nition can be extended to general multi-dimensional parameters, but requires involved no-\ntation. We give the full de\ufb01nition and our general results in the appendix.\nExample 4.3 (Bivariate Normal on Ellipse). Consider a bivariate normal distribution with diagonal\n\ncovariance matrix and mean vector restricted on an ellipse \u03b7(\u03b8)=[a cos(\u03b8), b sin(\u03b8)], that is,\n\u03b8\u2208(\u2212\u03c0, \u03c0), x\u2208 R2.\ngeometric curvature of the ellipse in the Euclidian space, \u03b3\u03b8= ab(a2 sin2(\u03b8)+ b2 cos2(\u03b8))\u22123~2.\n\nWe have that \u03a3\u03b8 equals the identity matrix in this case, and the statistical curvature equals the\n\n)+ a cos \u03b8 x1+ b sin \u03b8 x2)(cid:6),\n\np(x\u03b8)\u221d exp\u0001\u2212 1\n\n+ x2\n\n(x2\n\n2\n\n1\n\n2\n\n4\n\n1/\u2713\u2318(\u2713)\fThe statistical curvature was originally de\ufb01ned by Efron (1975) as the minimum amount of informa-\ntion loss when summarizing the sample using \ufb01rst order ef\ufb01cient estimators. Efron (1975) showed\nthat, extending the result of Fisher (1925) and Rao (1963),\n\nn\u2192\u221e[I X\n\n\u03b8\u2217\u2212I \u02c6\u03b8mle\n\u03b8\u2217\n\nlim\n\n]= \u03b32\n\u03b8\u2217 I\u03b8\u2217 ,\n\n(4)\n\n\u03b32\n\n4.2 Lower Bound\n\nresults in the sequel closely match this intuition.\n\n\u03b8\u2217, in contrast with the global MLE, which only loses \u03b32\n\nwhere I\u03b8\u2217 is the Fisher information (per data instance) of the distribution p(x\u03b8) at the true parameter\n\u2217, andI X\n\u03b8\u2217= nI\u03b8\u2217 is the total information included in a sample X of size n, andI \u02c6\u03b8mle\n\u03b8\u2217\nis the Fisher\n\u03b8\n\u03b8\u2217 units of Fisher\nas the effective number of data instances lost in MLE, easily seen from rewritingI \u02c6\u03b8mle\n\u2248 (n\u2212\ninformation included in \u02c6\u03b8mle based on X. Intuitively speaking, we lose about \u03b32\n\u03b8\u2217\n\u03b8\u2217)I\u03b8\u2217, as compared toI X\n\u03b8\u2217 = nI\u03b8\u2217. Moreover, this is the minimum possible information loss\ninformation when summarizing the data using the ML estimator. Fisher (1925) also interpreted \u03b32\n\u03b8\u2217\nin the class of \u201c\ufb01rst order ef\ufb01cient\u201d estimators T(X), those which satisfy the weaker condition\nlimn\u2192\u221eI\u03b8\u2217~I T\n\u03b8\u2217= 1. Rao coined the term \u201csecond order ef\ufb01ciency\u201d for this property of the MLE.\nonly through{\u02c6\u03b8k}, each of which summarizes the data with a loss of \u03b32\ntotal information loss is d\u22c5 \u03b32\nThe intuition here has direct implications for our distributed setting, since \u02c6\u03b8f depends on the data\n\u03b8\u2217 units of information. The\nTherefore, the additional loss due to the distributed setting is(d\u2212 1)\u22c5 \u03b32\n\u03b8\u2217 overall.\n\u03b8\u2217. We will see that our\nThe extra information loss(d\u2212 1)\u03b32\nerror rate n2E\u03b8\u2217[I\u03b8\u2217(\u02c6\u03b8f\u2212 \u02c6\u03b8mle)2] for any arbitrary combination function f(\u02c6\u03b81, . . . , \u02c6\u03b8d).\n\u03b8\u2217 turns out to be the asymptotic lower bound of the mean square\nTheorem 4.4 (Lower Bound). For an arbitrary measurable function \u02c6\u03b8f=f(\u02c6\u03b81, . . . , \u02c6\u03b8d), we have\nn\u2192+\u221e n2 E\u03b8\u2217[f(\u02c6\u03b81, . . . , \u02c6\u03b8d)\u2212 \u02c6\u03b8mle2]\u2265(d\u2212 1)\u03b32\nE\u03b8\u2217[\u02c6\u03b8f\u2212 \u02c6\u03b8mle2]= E\u03b8\u2217[\u02c6\u03b8f\u2212 E\u03b8\u2217(\u02c6\u03b8mle\u02c6\u03b81, . . . , \u02c6\u03b8d)2]+ E\u03b8\u2217[\u02c6\u03b8mle\u2212 E\u03b8\u2217(\u02c6\u03b8mle\u02c6\u03b81, . . . , \u02c6\u03b8d)2]\n\u2265 E\u03b8\u2217[\u02c6\u03b8mle\u2212 E\u03b8\u2217(\u02c6\u03b8mle\u02c6\u03b81, . . . , \u02c6\u03b8d)2]\n= E\u03b8\u2217[var\u03b8\u2217(\u02c6\u03b8mle\u02c6\u03b81, . . . , \u02c6\u03b8d)],\nwhere the lower bound is achieved when \u02c6\u03b8f = E\u03b8\u2217(\u02c6\u03b8mle\u02c6\u03b81, . . . , \u02c6\u03b8d). The conclusion follows by\nshowing that limn\u2192+\u221e E\u03b8\u2217[var\u03b8\u2217(\u02c6\u03b8mle\u02c6\u03b81, . . . , \u02c6\u03b8d)]=(d\u2212 1)\u03b32\n\u22121\n\u03b8\u2217 ; this requires involved asymp-\n\u03b8\u2217 I\njection of random variables (e.g., Van der Vaart, 2000). LetF be\nthe set of all random variables in the form of f(\u02c6\u03b81, . . . , \u02c6\u03b8d). The op-\ntimal consensus function should be the projection of \u02c6\u03b8mle ontoF,\nandF. The conditional expectation \u02c6\u03b8f = E\u03b8\u2217(\u02c6\u03b8mle\u02c6\u03b81, . . . , \u02c6\u03b8d) is the exact projection and ideally\n\u2217. We show in the sequel that \u02c6\u03b8KL gives an ef\ufb01cient approximation and\n\nthe best combination function; however, this is intractable to calculate due to the dependence on the\nunknown true parameter \u03b8\nachieves the same asymptotic lower bound.\n\nand the minimum mean square error is the distance between \u02c6\u03b8mle\n\nThe proof above highlights a geometric interpretation via the pro-\n\ntotic analysis, and is presented in the Appendix.\n\nlim inf\n\nSketch of Proof . Note that\n\n\u22121\n\u03b8\u2217 .\n\n\u03b8\u2217I\n\n4.3 General Consistent Combination\n\nWe now analyze the performance of a general class of \u02c6\u03b8f , which includes both the KL average \u02c6\u03b8KL\nand the linear average \u02c6\u03b8linear; we show that \u02c6\u03b8KL matches the lower bound in Theorem 4.4, while\n\u02c6\u03b8linear is not optimal even on full exponential families. We start by de\ufb01ning conditions which any\n\n\u201creasonable\u201d f(\u02c6\u03b81, . . . , \u02c6\u03b8d) should satisfy.\n\n5\n\n(\u21e11n2(d1)\u00b72\u02c6\u27131\u02c6\u2713df(\u02c6\u27131,...,\u02c6\u2713d)\u02c6\u2713mle\fDe\ufb01nition 4.5. (1). We say f(\u22c5) is consistent, if for \u2200\u03b8 \u2208 \u0398, \u03b8k \u2192 \u03b8, \u2200k \u2208 [d] implies\nf(\u03b81, . . . , \u03b8d)\u2192 \u03b8.\n(2). f(\u22c5) is symmetric if f(\u02c6\u03b81, . . . , \u02c6\u03b8d)= f(\u02c6\u03b8\u03c3(1)\nbe consistent. The symmetry is also straightforward due to the symmetry of the data partition{X k}.\nIn fact, if f(\u22c5) is not symmetric, one can always construct a symmetric version that performs better\nTheorem 4.6. (1). Consider a consistent and symmetric \u02c6\u03b8f = f(\u02c6\u03b81, . . . , \u02c6\u03b8d) as in De\ufb01nition 4.5,\n\n, . . . , \u02c6\u03b8\u03c3(d)), for any permutation \u03c3 on[d].\n\nThe consistency condition guarantees that if all the \u02c6\u03b8k are consistent estimators, then \u02c6\u03b8f should also\n\nor at least the same (see Appendix for details). We are now ready to present the main result.\n\nwhose \ufb01rst three orders of derivatives exist. Then, for curved exponential families in De\ufb01nition 4.1,\n\nE\u03b8\u2217[\u02c6\u03b8f\u2212 \u02c6\u03b8mle] = d\u2212 1\nE\u03b8\u2217[\u02c6\u03b8f\u2212 \u02c6\u03b8mle2] = d\u2212 1\n\n\u03b2f\n\nn\n\n\u03b8\u2217+ o(n\n\u22121),\n\u22c5[\u03b32\n\u03b8\u2217 +(d+ 1)(\u03b2f\n\u22121\n\u03b8\u2217 I\n\n\u03b8\u2217)2]+ o(n\n\nn2\n\nwhere \u03b2f\nsquare error is consistent with the lower bound in Theorem 4.4, and is tight if \u03b2f\n\n\u03b8\u2217 is a term that depends on the choice of the combination function f(\u22c5). Note that the mean\n\u03b8\u2217= 0, and hence achieves the minimum bias and mean square error,\nE\u03b8\u2217[\u02c6\u03b8KL\u2212 \u02c6\u03b8mle] = o(n\n\u22121),\n\u03b8\u2217\u2260 0.\n\nIn particular, note that the bias of \u02c6\u03b8KL is smaller in magnitude than that of general \u02c6\u03b8f with \u03b2f\n\n(2). The KL average \u02c6\u03b8KL has \u03b2f\n\n\u22122).\n\nn2\n\n(4). The linear averaging \u02c6\u03b8linear, however, does not achieve the lower bound in general. We have\n\n\u22122),\n\u03b8\u2217= 0.\n\u03b8\u2217 + o(n\n\u22c5 \u03b32\n\u22121\n\u03b8\u2217 I\n\n\u03b8\u2217\n\u03b2linear\n\nE\u03b8\u2217[\u02c6\u03b8KL\u2212 \u02c6\u03b8mle2] = d\u2212 1\n\u2217)\nE\u03b8\u2217\u0001 \u22023 log p(x\u03b8\n\u22122\u2217 (\u00a8\u03b7T\n= I\n\u03b8\u2217 \u03a3\u03b8\u2217 \u02d9\u03b7\u03b8\u2217+ 1\n\u22172] + d\u2212 1\n\u22c5 \u03b32\n\u03b8\u2217 + o(n\n\u22122).\n\u22121\n\u22172] + d\u2212 1\n\u03b8\u2217 I\n\u03b8\u2217 + 2(\u03b2linear\n\u22c5[\u03b32\n\u22121\n\u03b8\u2217\n\u03b8\u2217I\n\n(cid:6)),\n\n\u2202\u03b83\n\nn2\n\n2\n\nn2\n\n\u2217, by\n\n)2]+ o(n\n\n\u22122).\n\nwhich is in general non-zero even for full exponential families.\n\n(5). The MSE w.r.t. the global MLE \u02c6\u03b8mle can be related to the MSE w.r.t. the true parameter \u03b8\n\nE\u03b8\u2217[\u02c6\u03b8KL\u2212 \u03b8\nE\u03b8\u2217[\u02c6\u03b8linear\u2212 \u03b8\n\n\u22172] = E\u03b8\u2217[\u02c6\u03b8mle\u2212 \u03b8\n\u22172] = E\u03b8\u2217[\u02c6\u03b8mle\u2212 \u03b8\n\u2217 = Op(1~\u221a\n\nProof. See Appendix for the proof and the general results for multi-dimensional parameters.\n\nTheorem 4.6 suggests that \u02c6\u03b8f\u2212 \u02c6\u03b8mle= Op(1~n) for any consistent f(\u22c5), which is smaller in mag-\nnitude than \u02c6\u03b8mle\u2212 \u03b8\nn). Therefore, any consistent \u02c6\u03b8f is \ufb01rst order ef\ufb01cient, in that\nits difference from the global MLE \u02c6\u03b8mle is negligible compared to \u02c6\u03b8mle\u2212 \u03b8\n\u2217 asymptotically. This\n\u2217. However, we need to treat this claim with caution, because, as\n\nalso suggests that KL and the linear methods perform roughly the same asymptotically in terms of\nrecovering the true parameter \u03b8\nwe demonstrate empirically, the linear method may signi\ufb01cantly degenerate in the non-asymptotic\nregion or when the conditions in Theorem 4.6 do not hold.\n\n5 Experiments and Practical Issues\n\nWe present numerical experiments to demonstrate the correctness of our theoretical analysis. More\nimportantly, we also study empirical properties of the linear and KL combination methods that\nare not enlightened by the asymptotic analysis. We \ufb01nd that the linear average tends to degrade\nsigni\ufb01cantly when its local models (\u02c6\u03b8k) are not already close, for example due to small sample\nsizes, heterogenous data partitions, or non-convex likelihoods (so that different local models \ufb01nd\ndifferent local optima). In contrast, the KL combination is much more robust in practice.\n\n6\n\n\f(b).E(\u03b8f\u2212 \u02c6\u03b8mle)\n\n(a). E(\u03b8f\u2212 \u02c6\u03b8mle2)\n\n\u22172)\n\u2217, respectively. The y-axes are shown on logarithmic (base 10) scales.\n\nFigure 1: Result on the toy model in Example 4.3. (a)-(d) The mean square errors and biases of the\nlinear average \u02c6\u03b8linear and the KL average \u02c6\u03b8KL w.r.t. to the global MLE \u02c6\u03b8mle and the true parameter\n\u03b8\n\n(c). E(\u03b8f\u2212 \u03b8\n\n(d).E(\u03b8f\u2212 \u03b8\n\n\u2217)\n\n5.1 Bivariate Normal on Ellipse\n\nWe start with the toy model in Example 4.3 to verify our theoretical results. We draw samples from\nthe true model (assuming \u03b8\n\n\u2217= \u03c0~4, a= 1, b= 5), and partition the samples randomly into 10 sub-\ngroups (d= 10). Fig. 1 shows that the empirical biases and MSEs match closely with the theoretical\npredictions when the sample size is large (e.g., n\u2265 250), and \u02c6\u03b8KL is consistently better than \u02c6\u03b8linear\n\u2217, but\nlinear average degrades signi\ufb01cantly in the non-asymptotic region (e.g., n< 250).\n\nin terms of recovering both the global MLE and the true parameters. Fig. 1(b) shows that the bias\nof \u02c6\u03b8KL decreases faster than that of \u02c6\u03b8linear, as predicted in Theorem 4.6 (2). Fig. 1(c) shows that\nall algorithms perform similarly in terms of the asymptotic MSE w.r.t. the true parameters \u03b8\n\nModel Misspeci\ufb01cation. Model misspeci\ufb01cation is unavoidable in prac-\ntice, and may create multiple local modes in the likelihood objective,\nleading to poor behavior from the linear average. We illustrate this phe-\nnomenon using the toy model in Example 4.3, assuming the true model\n\nillustrated in the \ufb01gure at right, where the ellipse represents the paramet-\nric family, and the black square denotes the true model. The MLE will concentrate on the projection\n\nisN([0, 1~2], 12\u00d72), outside of the assumed parametric family. This is\nof the true model to the ellipse, in one of two locations (\u03b8=\u00b1\u03c0~2) indicated by the two red circles.\ntwo values; see Fig. 2(a). Given a suf\ufb01cient number of samples (n> 250), the probability that the\nMLE is at \u03b8\u2248\u2212\u03c0~2 (the less favorable mode) goes to zero. Fig. 2(b) shows KL averaging mimics the\n\nDepending on the random data sample, the global MLE will concentrate on one or the other of these\n\nbi-modal distribution of the global MLE across data samples; the less likely mode vanishes slightly\nslower. In contrast, the linear average takes the arithmetic average of local models from both of\nthese two local modes, giving unreasonable parameter estimates that are close to neither (Fig. 2(c)).\n\n(n= 10)\n\n(n= 10)\n\n(n= 10)\n\n(a). Global MLE \u02c6\u03b8mle\n\n(b). KL Average \u02c6\u03b8KL\n\n(c). Linear Average \u02c6\u03b8linear\n\nFigure 2: Result on the toy model in Example 4.3 with model misspeci\ufb01cation: scatter plots of the\nestimated parameters vs. the total sample size n (with 10,000 random trials for each \ufb01xed n). The\n\ninside \ufb01gures are the densities of the estimated parameters with \ufb01xed n= 10. Both global MLE and\nKL-average concentrate on two locations(\u00b1\u03c0~2), and the less favorable(\u2212\u03c0~2) vanishes when the\nsample sizes are large (e.g., n> 250). In contrast, the linear approach averages local MLEs from the\n\ntwo modes, giving unreasonable estimates spread across the full interval.\n\n7\n\n1502505001000\u22128\u22126\u22124Total Sample Size (n) Linear\u2212AvgKL\u2212AvgTheoretical1502505001000\u22126\u22125\u22124\u22123\u22122Total Sample Size (n)1502505001000\u22124\u22123.5\u22123Total Sample Size (n) Linear\u2212AvgKL\u2212AvgGlobal MLE1502505001000\u22124\u22123\u22122Total Sample Size (n)\u21e1/2\u21e1/20\u2713101001000\u2212\u03c0/20\u03c0/2Total Sample Size (n)Estimted Parameter\u2212\u03c0/20\u03c0/2101001000\u2212\u03c0/20\u03c0/2Total Sample Size (n)Estimted Parameter\u2212\u03c0/20\u03c0/2101001000\u2212\u03c0/20\u03c0/2Total Sample Size (n)Estimted Parameter\u2212\u03c0/20\u03c0/2\f(a) Training LL\n(random partition)\n\n(b) Test LL\n\n(random partition)\n\n(label-wise partition)\nFigure 3: Learning Gaussian mixture models on MNIST: training and test log-likelihoods of dif-\nferent methods with varying training size n. In (a)-(b), the data are partitioned into 10 sub-groups\nuniformly at random (ensuring sub-samples are i.i.d.); in (c)-(d) the data are partitioned according\nto their digit labels. The number of mixture components is \ufb01xed to be 10.\n\n(c) Training LL\n(label-wise partition)\n\n(d) Test LL\n\nFigure 4: Learning Gaussian mixture models\non the YearPredictionMSD data set. The data\nare randomly partitioned into 10 sub-groups,\nand we use 10 mixture components.\n\n(a) Training log-likelihood\n\n(b) Test log-likelihood\n\n5.2 Gaussian Mixture Models on Real Datasets\n\nWe next consider learning Gaussian mixture models. Because component indexes may be arbitrarily\nswitched, na\u00a8\u0131ve linear averaging is problematic; we consider a matched linear average that \ufb01rst\nmatches indices by minimizing the sum of the symmetric KL divergences of the different mixture\ncomponents. The KL average is also dif\ufb01cult to calculate exactly, since the KL divergence between\nGaussian mixtures is intractable. We approximate the KL average using Monte Carlo sampling (with\n500 samples per local model), corresponding to the parametric bootstrap discussed in Section 2.\n\nWe experiment on the MNIST dataset and the YearPredictionMSD dataset in the UCI repository,\nwhere the training data is partitioned into 10 sub-groups randomly and evenly. In both cases, we use\nthe original training/test split; we use the full testing set, and vary the number of training examples\nn by randomly sub-sampling from the full training set (averaging over 100 trials). We take the \ufb01rst\n100 principal components when using MNIST. Fig. 3(a)-(b) and 4(a)-(b) show the training and test\nlikelihoods. As a baseline, we also show the average of the log-likelihoods of the local models\n(marked as local MLEs in the \ufb01gures); this corresponds to randomly selecting a local model as\nthe combined model. We see that the KL average tends to perform as well as the global MLE, and\nremains stable even with small sample sizes. The na\u00a8\u0131ve linear average performs badly even with\nlarge sample sizes. The matched linear average performs as badly as the na\u00a8\u0131ve linear average when\nthe sample size is small, but improves towards to the global MLE as sample size increases.\n\nFor MNIST, we also consider a severely heterogenous data partition by splitting the images into 10\ngroups according to their digit labels. In this setup, each partition learns a local model only over\nits own digit, with no information about the other digits. Fig. 3(c)-(d) shows the KL average still\nperforms as well as the global MLE, but both the na\u00a8\u0131ve and matched linear average are much worse\neven with large sample sizes, due to the dissimilarity in the local models.\n\n6 Conclusion and Future Directions\n\nWe study communication-ef\ufb01cient algorithms for learning generative models with distributed data.\nAnalyzing both a common linear averaging technique and a less common KL-averaging technique\nprovides both theoretical and empirical insights. Our analysis opens many important future direc-\ntions, including extensions to high dimensional inference and ef\ufb01cient approximations for complex\nmachine learning models, such as LDA and neural networks.\n\n8\n\n500500050000\u2212630\u2212625\u2212620\u2212615Total Sample Size (n)500500050000\u2212635\u2212630\u2212625\u2212620Total Sample Size (n) Local MLEsGlobal MLELinear\u2212Avg\u2212MatchedLinear\u2212AvgKL\u2212Avg500500050000\u2212650\u2212640\u2212630\u2212620Total Sample Size (n)500500050000\u2212650\u2212640\u2212630\u2212620Total Sample Size (n)100010000100000\u2212140\u2212120\u2212100Training Sample Size (n) Local MLEsGlobal MLELinear\u2212Avg\u2212MatchedLinear\u2212AvgKL\u2212Avg100010000100000\u2212140\u2212120\u2212100Training Sample Size (n)\fAcknowledgements. This work sponsored in part by NSF grants IIS-1065618 and IIS-1254071,\nand the US Air Force under Contract No. FA8750-14-C-0011 under DARPA\u2019s PPAML program.\n\nReferences\nBradley Efron. De\ufb01ning the curvature of a statistical problem (with applications to second order\n\nef\ufb01ciency). The Annals of Statistics, pages 1189\u20131242, 1975.\n\nRonald Aylmer Fisher. Theory of statistical estimation. In Mathematical Proceedings of the Cam-\n\nbridge Philosophical Society, volume 22, pages 700\u2013725. Cambridge Univ Press, 1925.\n\nC Radhakrishna Rao. Criteria of estimation in large samples. Sankhy\u00afa: The Indian Journal of\n\nStatistics, Series A, pages 189\u2013206, 1963.\n\nYuchen Zhang, John C Duchi, and Martin J Wainwright. Communication-ef\ufb01cient algorithms for\n\nstatistical optimization. Journal of Machine Learning Research, 14:3321\u20133363, 2013a.\n\nSrujana Merugu and Joydeep Ghosh. Privacy-preserving distributed clustering using generative\n\nmodels. In IEEE Int\u2019l Conf. on Data Mining (ICDM), pages 211\u2013218. IEEE, 2003.\n\nSrujana Merugu and Joydeep Ghosh. Distributed learning using generative models. PhD thesis,\n\nUniversity of Texas at Austin, 2006.\n\nJoel B Predd, Sanjeev R Kulkarni, and H Vincent Poor. Distributed learning in wireless sensor\n\nnetworks. John Wiley & Sons: Chichester, UK, 2007.\n\nMaria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. Distributed learning, commu-\n\nnication complexity and privacy. arXiv preprint arXiv:1204.3514, 2012.\n\nYuchen Zhang, John Duchi, Michael Jordan, and Martin J Wainwright. Information-theoretic lower\nbounds for distributed statistical estimation with communication constraints. In Advances in Neu-\nral Information Processing Systems (NIPS), pages 2328\u20132336, 2013b.\n\nOhad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and\n\nestimation. arXiv preprint arXiv:1311.3494, 2013.\n\nPedro A Forero, Alfonso Cano, and Georgios B Giannakis. Distributed clustering using wireless\n\nsensor networks. IEEE Journal of Selected Topics in Signal Processing, 5(4):707\u2013724, 2011.\n\nYingyu Liang, Maria-Florina Balcan, and Vandana Kanchanapally. Distributed PCA and k-means\n\nclustering. In Big Learning Workshop, NIPS, 2013.\n\nSteven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I George,\nand Robert E McCulloch. Bayes and big data: The consensus Monte Carlo algorithm. In EFaB-\nBayes 250 conference, volume 16, 2013.\n\nXiangyu Wang and David B Dunson. Parallel MCMC via Weierstrass sampler. arXiv preprint\n\narXiv:1312.4605, 2013.\n\nWillie Neiswanger, Chong Wang, and Eric Xing. Asymptotically exact, embarrassingly parallel\n\nMCMC. arXiv preprint arXiv:1311.4780, 2013.\n\nQiang Liu and Alexander Ihler. Distributed parameter estimation via pseudo-likelihood. In Interna-\n\ntional Conference on Machine Learning (ICML), pages 1487\u20131494. July 2012.\n\nZ. Meng, D. Wei, A. Wiesel, and A.O. Hero III. Distributed learning of Gaussian graphical models\nvia marginal likelihoods. In Int\u2019l Conf. on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2013.\nStephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimiza-\ntion and statistical learning via the alternating direction method of multipliers. Foundations and\nTrends\u00ae in Machine Learning, 3(1):1\u2013122, 2011.\n\nRobert E Kass and Paul W Vos. Geometrical foundations of asymptotic inference, volume 908. John\n\nWiley & Sons, 2011.\n\nAad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.\n\n9\n\n\f", "award": [], "sourceid": 647, "authors": [{"given_name": "Qiang", "family_name": "Liu", "institution": "UC Irvine"}, {"given_name": "Alexander", "family_name": "Ihler", "institution": "UC Irvine"}]}