{"title": "Latent Bayesian melding for integrating individual and population models", "book": "Advances in Neural Information Processing Systems", "page_first": 3618, "page_last": 3626, "abstract": "In many statistical problems, a more coarse-grained model may be suitable for population-level behaviour, whereas a more detailed model is appropriate for accurate modelling of individual behaviour. This raises the question of how to integrate both types of models. Methods such as posterior regularization follow the idea of generalized moment matching, in that they allow matchingexpectations between two models, but sometimes both models are most conveniently expressed as latent variable models. We propose latent Bayesian melding, which is motivated by averaging the distributions over populations statistics of both the individual-level and the population-level models under a logarithmic opinion pool framework. In a case study on electricity disaggregation, which is a type of single-channel blind source separation problem, we show that latent Bayesian melding leads to significantly more accurate predictions than an approach based solely on generalized moment matching.", "full_text": "Latent Bayesian melding for integrating individual\n\nand population models\n\nMingjun Zhong, Nigel Goddard, Charles Sutton\n\n{mzhong,nigel.goddard,csutton}@inf.ed.ac.uk\n\nSchool of Informatics\nUniversity of Edinburgh\n\nUnited Kingdom\n\nAbstract\n\nIn many statistical problems, a more coarse-grained model may be suitable for\npopulation-level behaviour, whereas a more detailed model is appropriate for ac-\ncurate modelling of individual behaviour. This raises the question of how to in-\ntegrate both types of models. Methods such as posterior regularization follow\nthe idea of generalized moment matching, in that they allow matching expec-\ntations between two models, but sometimes both models are most conveniently\nexpressed as latent variable models. We propose latent Bayesian melding, which\nis motivated by averaging the distributions over populations statistics of both the\nindividual-level and the population-level models under a logarithmic opinion pool\nframework. In a case study on electricity disaggregation, which is a type of single-\nchannel blind source separation problem, we show that latent Bayesian melding\nleads to signi\ufb01cantly more accurate predictions than an approach based solely on\ngeneralized moment matching.\n\n1\n\nIntroduction\n\nGood statistical models of populations are often very different from good models of individuals.\nAs an illustration, the population distribution over human height might be approximately normal,\nbut to model an individual\u2019s height, we might use a more detailed discriminative model based on\nmany features of the individual\u2019s genotype. As another example, in social network analysis, simple\nmodels like the preferential attachment model [3] replicate aggregate network statistics such as\ndegree distributions, whereas to predict whether two individuals have a link, a social networking\nweb site might well use a classi\ufb01er with many features of each person\u2019s previous history. Of course\nevery model of an individual implies a model of the population, but models whose goal is to model\nindividuals tend to be necessarily more detailed.\nThese two styles of modelling represent different types of information, so it is natural to want to\ncombine them. A recent line of research in machine learning has explored the idea of incorporating\nconstraints into Bayesian models that are dif\ufb01cult to encode in standard prior distributions. These\nmethods, which include posterior regularization [9], learning with measurements [16], and the gen-\neralized expectation criterion [18], tend to follow a moment matching idea, in which expectations of\nthe distribution of one model are encouraged to match values based on prior information.\nInterestingly, these ideas have precursors in the statistical literature on simulation models. In partic-\nular, Bayesian melding [21] considers applications in which there is a computer simulation M that\nmaps from model parameters \u03b8 to a quantity \u03c6 = M(\u03b8). For example, M might summarize the\noutput of a deterministic simulation of population dynamics or some other physical phenomenon.\nBayesian melding considers the case in which we can build meaningful prior distributions over both\n\u03b8 and \u03c6. These two prior distributions need to be merged because of the deterministic relationship;\n\n1\n\n\fthis is done using a logarithmic opinion pool [5]. We show that there is a close connection between\nBayesian melding and the later work on posterior regularization, which does not seem to have been\nrecognized in the machine learning literature. We also show that Bayesian melding has the addi-\ntional advantage that it can be conveniently applied when both individual-level and population-level\nmodels contain latent variables, as would commonly be the case, e.g., if they were mixture models\nor hierarchical Bayesian models. We call this approach latent Bayesian melding.\nWe present a detailed case study of latent Bayesian melding in the domain of energy disaggregation\n[11, 20], which is a particular type of blind source separation (BSS) problem. The goal of the\nelectricity disaggregation problem is to separate the total electricity usage of a building into a sum of\nsource signals that describe the energy usage of individual appliances. This problem is hard because\nthe source signals are not identi\ufb01able, which motivates work that adds additional prior information\ninto the model [14, 15, 20, 25, 26, 8]. We show that the latent Bayesian melding approach allows\nincorporation of new types of constraints into standard models for this problem, yielding a strong\nimprovement in performance, in some cases amounting to a 50% error reduction over a moment\nmatching approach.\n\n2 The Bayesian melding approach\n\nWe brie\ufb02y describe the Bayesian melding approach to integrating prior information in deterministic\nsimulation models [21], which has seen wide application [1, 6, 23].\nIn the Bayesian modelling\ncontext, denote Y as the observation data, and suppose that the model includes unknown variables\nS, which could include model parameters and latent variables. We are then interested in the posterior\n\np(S|Y ) = p(Y )\u22121p(Y |S)pS(S).\n\nt=1 ST\n\n\u03c4 =(cid:80)T\n\n(1)\nHowever, in some situations, the variables S may be related to a new random variable \u03c4 by a de-\nterministic simulation function f(\u00b7) such that \u03c4 = f(S). We call S and \u03c4 input and output vari-\nables. For example, in the energy disaggregation problem, the total energy consumption variable\nt \u00b5 where St are the state variables of a hidden Markov model (one-hot encoding) and\n\u00b5 is a vector containing the mean energy consumption of each state (see Section 5.2). Both \u03c4 and\nS are random variables, and so in the Bayesian context, the modellers usually choose appropriate\npriors p\u03c4 (\u03c4) and pS(S) based on prior knowledge. However, given pS(S), the map f naturally\nintroduces another prior for \u03c4, which is an induced prior denoted by p\u2217\n\u03c4 (\u03c4). Therefore, there are\ntwo different priors for the same variable \u03c4 from different sources, which might not be consistent.\nIn the energy disaggregation example, p\u2217\n\u03c4 is induced by the state variables St of the hidden Markov\nmodel which is the individual model of a speci\ufb01c household, and p\u03c4 could be modelled by using\npopulation information, e.g. from a national survey \u2014 we can think of this as a population model\nsince it combines information from many households. The Bayesian melding approach combines\nthe two priors into one by using the logarithmic pooling method so that the logarithmically pooled\n\n\u03c4 (\u03c4)\u03b1p\u03c4 (\u03c4)1\u2212\u03b1 where 0 \u2264 \u03b1 \u2264 1. The prior (cid:101)p\u03c4 melds the prior information\n\nprior is (cid:101)p\u03c4 (\u03c4) \u221d p\u2217\n\nof both S and \u03c4. In the model (1), the prior pS does not include information about \u03c4. Thus it is\nrequired to derive a melded prior for S. If f is invertible, the prior for S can be obtained by using\nthe change-of-variable technique. If f is not invertible, Poole and Raftery [21] heuristically derived\na melded prior\n\n(cid:101)pS(S) = c\u03b1pS(S)\n\n(cid:18) p\u03c4 (f(S))\n\n(cid:19)1\u2212\u03b1\n\nwhere c\u03b1 is a constant given \u03b1 such that(cid:82) (cid:101)pS(S)dS = 1. This gives a new posterior (cid:101)p(S|Y ) =\n(cid:101)p(Y )\u22121p(Y |S)(cid:101)pS(S). Note that it is interesting to infer \u03b1 [22, 7], however we use a \ufb01xed value in\n\nthis paper. So far we have been assuming there are no latent variables in p\u03c4 . We now consider the\nsituation when \u03c4 is generated by some latent variables.\n\np\u2217\n\u03c4 (f(S))\n\n(2)\n\n3 The latent Bayesian melding approach\n\nIt is common that the variable \u03c4 is modelled by a latent variable \u03be, see the examples in Section 5.2.\nSo we could assume that we have a conditional distribution p(\u03c4|\u03be) and a prior distribution p\u03be(\u03be).\n\nThis de\ufb01nes a marginal distribution p\u03c4 (\u03c4) = (cid:82) p\u03be(\u03be)p(\u03c4|\u03be)d\u03be. This could be used to produce the\n\n2\n\n\fmelded prior (2) of the Bayesian melding approach\n\n(cid:101)pS(S) = c\u03b1pS(S)\n\n(cid:18)(cid:82) p\u03c4 (f(S)|\u03be)p\u03be(\u03be)d\u03be\n\np\u2217\n\u03c4 (f(S))\n\n(cid:19)1\u2212\u03b1\n\n.\n\n(3)\n\n(cid:18) p\u03c4 (f(S)|\u03be)p\u03be(\u03be)\n\n\u03be (cid:101)pS,\u03be(S, \u03be) = max\n\nThe integration in (3) is generally intractable. We could employ the Monte Carlo method to approx-\nimate it for a \ufb01xed \u03c4. However, importantly we are also interested in inferring the latent variable \u03be\nwhich is meaningful for example in the energy disaggregation problem. When we are interested in\n\n\ufb01nding the maximum a posteriori (MAP) value of the posterior where(cid:101)pS(S) was used as the prior,\nwe propose to use a rough approximation(cid:82) p\u03be(\u03be)p\u03c4 (\u03c4|\u03be)d\u03be \u2248 max\u03be p\u03be(\u03be)p\u03c4 (\u03c4|\u03be). This leads to an\napproximate prior(cid:101)pS(S) \u2248 max\nTo obtain this approximate prior for S, the joint prior(cid:101)pS,\u03be(S, \u03be) has to exist, and so we show that it\ndoes exist under certain conditions by the following theorem. We assume that S and \u03be are continuous\n\u03c4 and p\u03c4 are positive and share the same support. Also, EpS (S)[\u00b7]\nrandom variables, and that both p\u2217\n(cid:105)\ndenotes the expectation with respect to pS.\n(cid:82) (cid:101)pS,\u03be(S, \u03be)d\u03bedS = 1, for any \ufb01xed \u03b1 \u2208 [0, 1].\nTheorem 1.\nimate joint prior(cid:101)pS,\u03be. Interestingly, if \u03be and S are independent conditional on \u03c4, we can show as\nfollows that(cid:101)pS,\u03be is a limit distribution derived from a joint distribution of \u03be and S induced by \u03c4. To\n\nThe proof can be found in the supplementary materials. In (4) we heuristically derived an approx-\n\nthen a constant c\u03b1 < \u221e exists such that\n\n(cid:104) p\u03c4 (f (S))\n\n(cid:19)1\u2212\u03b1\n\np\u2217\n\u03c4 (f(S))\n\nIf EpS (S)\n\n< \u221e,\n\nc\u03b1pS(S)\n\np\u2217\n\u03c4 (f (S))\n\n(4)\n\n.\n\n\u03be\n\nsee this, we derive a joint prior for S and \u03be,\np(S, \u03be|\u03c4)p\u03c4 (\u03c4)d\u03c4 =\n\npS,\u03be(S, \u03be) =\n\n(cid:90)\n(cid:90) p(\u03c4|S)pS(S)\n\np\u2217\n\u03c4 (\u03c4)\n\n=\n\np(\u03c4|\u03be)p\u03be(\u03be)\n\np\u03c4 (\u03c4)\n\n(cid:90)\n\np(S|\u03c4)p(\u03be|\u03c4)p\u03c4 (\u03c4)d\u03c4\n\np\u03c4 (\u03c4)d\u03c4 = pS(S)p\u03be(\u03be)\n\n(cid:90)\n\np(\u03c4|S) p(\u03c4|\u03be)\n\n\u03c4 (\u03c4) d\u03c4.\np\u2217\n\nis then denoted by p\u03b4(\u03c4|S). The marginal distribution is p\u03b4(\u03c4) = (cid:82) p\u03b4(\u03c4|S)pS(S)dS. Denote\n\nFor a deterministic simulation \u03c4 = f(S), the distribution p(\u03c4|S) = p(\u03c4|S, \u03c4 = f(S)) is ill-de\ufb01ned\ndue to the Borel\u2019s paradox [24]. The distribution p(\u03c4|S) depends on the parameterization. We\nassume that \u03c4 is uniform on [f(S) \u2212 \u03b4, f(S) + \u03b4] conditional on S and \u03b4 > 0, and the distribution\ng(\u03c4) = p(\u03c4|\u03be)\nTheorem 2. If lim\u03b4\u21920 p\u03b4(\u03c4) = p\u2217\nlim\u03b4\u21920\n\n\u03c4 (\u03c4 ) and g\u03b4(\u03c4) = p(\u03c4|\u03be)\n(cid:82) p\u03b4(\u03c4|S)g\u03b4(\u03c4)d\u03c4 = g(f(S)).\np\u2217\n\n\u03c4 (\u03c4), and g\u03b4(\u03c4) has bounded derivatives in any order, then\n\np\u03b4(\u03c4 ) . Then we have the following theorem.\n\nSee the supplementary materials for the proof. Under this parameterization, we denote \u02c6pS,\u03be(S, \u03be) =\npS(S)p\u03be(\u03be) lim\u03b4\u21920\n\u03c4 (f (S)) . By applying the logarithmic pool-\np\u2217\ning method, we have a joint prior\n\n(cid:82) p\u03b4(\u03c4|S)g\u03b4(\u03c4)d\u03c4 = pS(S)p\u03be(\u03be) p(f (S)|\u03be)\n(cid:101)pS,\u03be(S, \u03be) = c\u03b1 (pS(S))\u03b1 (\u02c6pS,\u03be(S, \u03be))1\u2212\u03b1 = c\u03b1pS(S)\n\n(cid:18) p\u03c4 (f(S)|\u03be)p\u03be(\u03be)\n\nSince the joint prior blends the variable S and the latent variable \u03be, we call this approxi-\n\nmation the latent Bayesian melding (LBM) approach, which gives the posterior (cid:101)p(S, \u03be|Y ) =\n(cid:101)p(Y )\u22121p(Y |S)(cid:101)pS,\u03be(S, \u03be). Note that if there are no latent variables, then latent Bayesian meld-\n\ning collapses to the Bayesian melding approach. In section 6 we will apply this method to an energy\ndisaggregation problem for integrating population information with an individual model.\n\n(cid:19)1\u2212\u03b1\n\np\u2217\n\u03c4 (f(S))\n\n.\n\n4 Related methods\n\nWe now discuss possible connections between Bayesian melding (BM) and other related methods.\nRecently in machine learning, moment matching methods have been proposed, e.g., posterior regu-\nlarization (PR) [9], learning with measurements [16] and the generalized expectation criterion [18].\n\n3\n\n\fThese methods share the common idea that the Bayesian models (or posterior distributions) are con-\nstrained by some observations or measurements to obtain a least-biased distribution. The idea is\nthat the system we are modelling is too complex and unobservable, and thus we have limited prior\ninformation. To alleviate this problem, we assume we can obtain some observations of the system\nin some way, e.g., by experiments, for example those observations could be the mean values of\nthe functions of the variables. Those observations could then guide the modelling of the system.\nInterestingly, a very similar idea has been employed in the bias correction method in information\ntheory and statistics [12, 10, 19], where the least-biased distribution is obtained by optimizing the\nKullback-Leibler divergence subject to the moment constraints. Note that the bias correction method\nin [17] is different to others where the bias of a consistent estimator was corrected when the bias\nfunction could be estimated.\nWe now consider the posteriors derived by PR and BM. In general, given a function f(S) and values\nbi, PR solves the constrained problem\nminimize\n\nKL((cid:101)p(S)||p(S|Y ))\nsubject to Eep (mi(f(S))) \u2212 bi \u2264 \u03b4i,||\u03b4i|| \u2264 \u0001; i = 1, 2,\u00b7\u00b7\u00b7 , I.\nep\n(cid:101)pP R(S) = Z(\u03bb)\u22121p(Y |S)p(S)(cid:81)I\nwhere mi could be any function such as a power function. This gives an optimal posterior\ni=1 exp(\u2212\u03bbimi(f(S))) where Z(\u03bb) is the normalizing con-\n(cid:16) p\u03c4 (f (S))\nstant. BM has a deterministic simulation f(S) = \u03c4 where \u03c4 \u223c p\u03c4 . The posterior is then\n(cid:101)pBM (S) = Z(\u03b1)\u22121p(Y |S)p(S)\nthe last factor which is derived from the constraints or the deterministic simulation. (cid:101)pP R and(cid:101)pBM\nare identical, if \u2212(cid:80)I\n. They have a similar form and the key difference is\n\ni=1 \u03bbimi(f(S)) = (1 \u2212 \u03b1) log p\u03c4 (f (S))\n\u03c4 (f (S)).\np\u2217\n\n(cid:17)1\u2212\u03b1\n\np\u2217\n\u03c4 (f (S))\n\nThe difference between BM and LBM is the latent variable \u03be. We could perform BM by integrating\nout \u03be in (3), but this is computationally expensive. Instead, LBM jointly models S and \u03be allowing\npossibly joint inference, which is an advantage over BM.\n\n5 The energy disaggregation problem\n\nsignals so that Yt = (cid:80)I\n\nIn energy disaggregation, we are given a time series of energy consumption readings from a sensor.\nWe consider the energy measured in watt hours as read from a household\u2019s electricity meter, which is\ndenoted by Y = (Y1, Y2,\u00b7\u00b7\u00b7 , YT ) where Yt \u2208 R+. The recorded energy signal Y is assumed to be\nthe aggregation of the consumption of individual appliances in the household. Suppose there are I\nappliances, and the energy consumption of each appliance is denoted by Xi = (Xi1, Xi2,\u00b7\u00b7\u00b7 , XiT )\nwhere Xit \u2208 R+. The observed aggregate signal is assumed to be the sum of the component\ni=1 Xit + \u0001t where \u0001t \u223c N (0, \u03c32). Given Y , the task is to infer the\nunknown component signals Xi. This is essentially the single-channel BSS problem, for which\nthere is no unique solution. It can also be useful to add an extra component U = (U1, U2,\u00b7\u00b7\u00b7 , UT )\nto model the unknown appliances to make the model more robust as proposed in [15]. The prior\nof Ut is de\ufb01ned as p(U) =\n. The model then has a new\ni=1 Xit + Ut + \u0001t. A natural way to represent this model is as an additive factorial\nhidden Markov model (AFHMM) where the appliances are treated as HMMs [15, 20, 26]; this is\nnow described.\n\nt=1 |Ut+1 \u2212 Ut|(cid:111)\n(cid:80)T\u22121\n\nform Yt = (cid:80)I\n\n(cid:110)\u2212 1\n\nv2(T \u22121) exp\n\n1\n\n2v2\n\n5.1 The additive factorial hidden Markov model\n\nHMM, the initial probabilities are \u03c0ik = P (Zi1 = k) (k = 1, 2,\u00b7\u00b7\u00b7 , Ki) where(cid:80)Ki\n\nIn the AFHMM, each component signal Xi is represented by a HMM. We suppose there are Ki\nstates for each Xit, and so the state variable is denoted by Zit \u2208 {1, 2,\u00b7\u00b7\u00b7 , Ki}. Since Xi is a\nk=1 \u03c0ik = 1;\nthe mean values are \u00b5i = {\u00b51, \u00b52,\u00b7\u00b7\u00b7 , \u00b5Ki} such that Xit \u2208 \u00b5i; the transition probabilities are\nP (i) = (p(i)\njk = 1. We denote all these\nparameters {\u03c0i, \u00b5i, P (i)} by \u03b8. We assume they are known and can be learned from the training\ndata. Instead of using Z, we could use a binary vector Sit = (Sit1, Sit2,\u00b7\u00b7\u00b7 , SitKi)T to represent\nthe variable Z such that Sitk = 1 when Zit = k and for all Sitj = 0 when j (cid:54)= k. Then we are\ninterested in inferring the states Sit instead of inferring Xit directly, since Xit = ST\nit \u00b5i. Therefore\n\njk = P (Zit = j|Zi,t\u22121 = k) and(cid:80)Ki\n\njk ) where p(i)\n\nj=1 p(i)\n\n4\n\n\fwe want to make inference over the posterior distribution\n\n(cid:81)Ki\n\nthe states P (S|\u03b8) \u221d (cid:81)I\n(cid:17)2(cid:27)\n(cid:16)\nYt \u2212(cid:80)I\n\n(cid:80)T\n\n(cid:26)\n\nP (S, U, \u03c32|Y, \u03b8) \u221d p(Y |S, U, \u03c32)P (S|\u03b8)p(U)p(\u03c32)\n\n(cid:81)I\n\nwhere the HMM de\ufb01nes\n\n(cid:17)SitkSi,t\u22121,j , the inverse noise variance is assumed to be a Gamma dis-\n(cid:81)T\ntribution p(\u03c3\u22122) \u221d (\u03c3\u22122)\u03b1\u22121 exp(cid:8)\u2212\u03b2\u03c3\u22122(cid:9), and the data likelihood has the Gaussian form\n\nthe prior of\n\nk=1 \u03c0Si1k\n\n(cid:81)\n\n(cid:16)\n\np(i)\nkj\n\n\u00d7\n\nt=2\n\ni=1\n\nik\n\ni=1\n\nk,j\n\n2 exp\n\n\u2212 1\n\np(Y |S, U, \u03c32, \u03b8) = |2\u03c0\u03c32|\u2212 T\n. To make the MAP\ninference over S, we relax the binary variable Sitk to be continuous in the range [0, 1] as in [15, 26].\nIt has been shown that incorporating domain knowledge into AFHMM can help to reduce the iden-\nti\ufb01ability problem [15, 20, 26]. The domain knowledge we will incorporate using LBM is the\nsummary statistics.\n\nit \u00b5i \u2212 Ut\n\ni=1 ST\n\n2\u03c32\n\nt=1\n\n5.2 Population modelling of summary statistics\n\nt=1\n\nt=1 ST\n\n(cid:80)Ki\n\n(cid:80)Ki\n\nIn energy disaggregation, it is useful to provide a summaries of energy consumption to the users.\nFor example, it would be useful to show the householders the total energy they had consumed in\none day for their appliances, the duration that each appliance was in use, and the number of times\nthat they had used these appliances. Since there already exists data about typical usage of different\nappliances [4], we can employ these data to model the distributions of those summary statistics.\nWe denote those desired statistics by \u03c4 = {\u03c4i}I\ni=1, where i denotes the appliances. For appliance\ni, we assume we have measured some time series from different houses for many days. This is\nalways possible because we can collect them from public data sets, e.g., the data reviewed in [4].\nWe can then empirically obtain the distributions of those statistics. The distribution is represented by\npm(\u03c4im|\u0393im, \u03b7im) where \u0393im represents the empirical quantities of the statistic m of the appliance\ni which can be obtained from data and \u03b7im are the latent variables which might not be known. Since\n\u03b7im are variables, we can employ a prior distribution p(\u03b7im).\nWe now give some examples of those statistics. Total energy consumption: The total energy\nconsumption of an appliance can be represented as a function of the states of HMM such that \u03c4i =\nit \u00b5i. Duration of appliance usage: The duration of using the appliance i can also be\nk=2 Sitk where \u2206t represents the sampling\nduration for a data point of the appliances, and we assume that Sit1 represents the off state which\nmeans the appliance was turned off. Number of cycles: The number of cycles (the number of times\nan appliance is used) can be counted by computing the number of alterations from OFF state to ON\n\n(cid:80)T\nrepresented as a function of states \u03c4i = \u2206t(cid:80)T\nsuch that \u03c4i =(cid:80)T\n1 means that the appliance i had been used c cycles, and(cid:80)Ci\ndata. One approach is to assume a Gaussian mixture density such that p(\u03c4i|\u03bei) = (cid:80)Ci\n1)pc(\u03c4i|\u0393i), where(cid:80)Ci\nwould be a linear regression model such that \u03c4i = (cid:80)Ci\nWhen \u03c4i represents the number of cycles for appliance i, we can use \u03c4i = (cid:80)Ci\nemploy a noise model such that \u03c4i =(cid:80)Ci\ndiscrete distribution such that P (\u03bei) =(cid:81)Ci\n\nLet the binary vector \u03bei = (\u03bei1, \u03bei2,\u00b7\u00b7\u00b7 , \u03beic,\u00b7\u00b7\u00b7 , \u03beiCi) represent the number of cycles, where \u03beic =\nc=1 \u03beic = 1. (Note \u03bei is an example of \u03b7i\nin this case.) To model these statistics in our LBM framework, the latent variable that we use is the\nnumber of cycles \u03be. The distributions of \u03c4i could be empirically modelled by using the observation\nc=1 p(\u03beic =\nc=1 p(\u03beic = 1) = 1 and pc is the Gaussian component density. Using the\nmixture Gaussian, we basically assume that, for an appliance, given the number of cycles the total\nenergy consumption is modelled by a Gaussian with mean \u00b5ic and variance \u03c32\nic. A simpler model\ni ). This\nmodel assumes that given the number of cycles the total energy consumption is close to the mean\n\u00b5ic. The mixture model is more appropriate than the regression model, but the inference is more\ndif\ufb01cult.\n\nc=1 cic\u03beic where cic\nrepresents the number of cycles. When the state variables Si are relaxed to [0, 1], we can then\ni ). We model \u03bei with a\nic where pic represents the prior probability of the\nnumber of cycles for the appliance i, which can be obtained from the training data. We now show\nthat how to use the LBM to integrate the AFHMM with these population distributions.\n\nc=1 cic\u03beic + \u0001i where \u0001 \u223c N (0, \u03c32\nc=1 p\u03beic\n\nc=1 \u03beic\u00b5ic + \u0001i where \u0001i \u223c N (0, \u03c32\n\nk=2 I(Sitk = 1, Si,t\u22121,1 = 0).\n\nt=2\n\n5\n\n\f6 The latent Bayesian melding approach to energy disaggregation\n\nover S and \u03be such that (cid:101)pS,\u03be(S, \u03be) = c\u03b1pS(S)\n\nWe have shown that the summary statistics \u03c4 can be represented as a deterministic function of the\nstate variable of HMMs S such that \u03c4 = f(S), which means that the \u03c4 itself can be represented as\na latent variable model. We could then straightforwardly employ the LBM to produce a joint prior\n. Since in our model f is not\np\u2217\n\u03c4 (f (S))\ninvertible, we need to generate a proper density for p\u2217\n\u03c4 . One possible way is to generate N random\nsamples {S(n)}N\n\u03c4 can be modelled by using\nkernel density estimation. However, this will make the inference dif\ufb01cult. Instead, we employ a\nGaussian density p\u2217\nn=1.\nThe new posterior distribution of LBM thus has the form\n\nn=1 from the prior pS(S) which is a HMM, and then p\u2217\n\nim are computed from {S(n)}N\n\n(cid:16) p\u03c4 (f (S)|\u03be)p(\u03be)\n\n(\u03c4im) = N (\u02c6\u00b5im, \u02c6\u03c32\n\nim) where \u02c6\u00b5im and \u02c6\u03c32\n\n(cid:17)1\u2212\u03b1\n\n\u03c4im\n\np(S, U, \u03a3|Y, \u03b8) \u221d p(\u03a3)p(U)(cid:101)pS,\u03be(S, \u03be)p(Y |S, U, \u03c32)\n(cid:18) p\u03c4 (f(S)|\u03be)p(\u03be)\n\n(cid:19)1\u2212\u03b1\n\n= p(\u03a3)p(U)c\u03b1pS(S)\n\np\u2217\n\u03c4 (f(S))\n\np(Y |S, U, \u03c32)\n\nwhere \u03a3 represents the collection of all the noise variances. All the inverse noise variances employ\nthe Gamma distribution as the prior. We are interested in inferring the MAP values. Since the vari-\nables S and \u03be are binary, we have to solve a combinatorial optimization problem which is intractable,\nso we solve a relaxed problem as in [15, 26]. Since log pS(S) is not convex, we employ the relax-\nation method of [15]. So a new Ki\u00d7Ki variable matrix H it = (hit\njk = 1\njk = 0. Under these constraints, we then obtain\nwhen Si,t\u22121,k = 1 and Sitj = 1 and otherwise hit\n(cid:111)\njk log p(i)\njk ; this is now linear. We optimize\ni,t,k,j hit\nthe log-posterior which is denoted by L(S, H, U, \u03a3, \u03be). The constraints for those variables are repre-\n(cid:111)\nsented as sets QS =\n, Q\u03be =\nk=1 Sitk = 1, Sitk \u2208 [0, 1],\u2200i, t\nc=1 \u03beic = 1, \u03beic \u2208 [0, 1],\u2200i\n,\nQH,S\njk \u2208 [0, 1],\u2200i, t\n\nlog pS(S) = log p(S, H) =(cid:80)I\n(cid:110)(cid:80)Ki\n(cid:111)\n(cid:110)(cid:80)Ki\nim,\u2200i, m(cid:9). Denote Q = QS \u222a Q\u03be \u222a QH,S \u222a QU,\u03a3. The relaxed\n(cid:8)U \u2265 0, \u03a3 \u2265 0, \u03c32\n\ni1 log \u03c0i +(cid:80)\ni,t\u22121,(cid:80)Ki\n\njk) is introduced such that hit\n\n(cid:110)(cid:80)Ci\n\nand QU,\u03a3\n\n.l = Sit, hit\n\nim < \u02c6\u03c32\n\nl. = ST\n\nl=1 H it\n\nl=1 H it\n\ni=1 ST\n\n=\n\n,\n\n=\n\noptimization problem is then\n\nmaximize\nS,H,U,\u03a3,\u03be\n\nL(S, H, U, \u03a3, \u03be)\n\nsubject to Q.\n\nWe oberved that every term in L is either quadratic or linear when \u03a3 are \ufb01xed, and the solutions\nfor \u03a3 are deterministic when the other variables are \ufb01xed. The constraints are all linear. Therefore,\nwe optimize \u03a3 while \ufb01xing all the other variables, and then optimize all the other variables simulta-\nneously while \ufb01xing \u03a3. This optimization problem is then a convex quadratic program (CQP), for\nwhich we use MOSEK [2]. We denote this method by AFHMM+LBM.\n\n7 Experimental results\n\nWe have incorporated population information into the AFHMM by employing the latent Bayesian\nmelding approach. In this section, we apply the proposed model to the disaggregation problem. We\nwill compare the new approach with the AFHMM+PR [26] using the set of statistics \u03c4 described\nin Section 5.2. The key difference between our method AFHMM+LBM and AFHMM+PR is that\nAFHMM+LBM models the statistics \u03c4 conditional on the number of cycles \u03be.\n\n7.1 The HES data\n\nWe apply AFHMM, AFHMM+PR and AFHMM+LBM to the Household Electricity Survey (HES)\ndata1. This data set was gathered in a recent study commissioned by the UK Department of Food and\nRural Affairs. The study monitored 251 households, selected to be representative of the population,\nacross England from May 2010 to July 2011 [27]. Individual appliances were monitored, and in\nsome households the overall electricity consumption was also monitored. The data were monitored\n\n1The HES dataset and information on how the raw data was cleaned can be found from\n\nhttps://www.gov.uk/government/publications/household-electricity-survey.\n\n6\n\n\fTable 1: Normalized disaggregation error (NDE), signal aggregate error (SAE), duration aggregate\nerror (DAE), and cycle aggregate error (CAE) by AFHMM+PR and AFHMM+LBM on synthetic\nmains in HES data.\nMETHODS\nAFHMM\n\nNDE\n\nDAE\n\nCAE\n\nSAE\n\n1.45\u00b1 0.88\n0.87\u00b1 0.21\nAFHMM+PR\nAFHMM+LBM 0.89\u00b1 0.49\n\n1.42\u00b1 0.39\n0.86\u00b1 0.39\n0.87\u00b1 0.37\n\n1.56\u00b10.23\n0.83\u00b10.53\n0.76\u00b10.32\n\n1.41\u00b10.31\n1.57\u00b10.66\n0.79\u00b10.35\n\nTIME (S)\n179.3\u00b11.9\n195.4\u00b13.2\n198.1\u00b13.1\n\nTable 2: Normalized disaggregation error (NDE), signal aggregate error (SAE), duration aggregate\nerror (DAE), and cycle aggregate error (CAE) by AFHMM+PR and AFHMM+LBM on mains in\nHES data.\n\nNDE\n\nMETHODS\nAFHMM\n\n1.90\u00b11.16\n0.91\u00b10.11\nAFHMM+PR\nAFHMM+LBM 0.77\u00b10.23\n\nSAE\n\n2.26\u00b10.86\n0.67\u00b1 0.07\n0.68\u00b1 0.19\n\nDAE\n\n1.91\u00b10.67\n0.68\u00b1 0.18\n0.61\u00b1 0.22\n\nCAE\n\n1.12 \u00b10.17\n1.65 \u00b10.49\n0.98\u00b10.32\n\nTIME (S)\n170.8\u00b133.3\n214.2\u00b138.1\n224.8\u00b134.8\n\nevery 2 or 10 minutes for different houses. We used only the 2-minute data. We then used the\nindividual appliances to train the model parameters \u03b8 of the AFHMM, which will be used as the\ninput to the models for disaggregation. Note that we assumed the HMMs have 3 states for all the\nappliances. This number of states is widely applied in energy disaggregation problems, though our\nmethod could easily be applied to larger state spaces. In the HES data, in some houses the overall\nelectricity consumption (the mains) was monitored. However, in most houses, only a subset of\nindividual appliances were monitored, and the total electricity readings were not recorded.\nGenerating the population information: Most of the houses in HES did not monitor the mains\nreadings. They all recorded the individual appliances consumption. We used a subset of the houses\nto generate the population information of the individual appliances. We used the population infor-\nmation of total energy consumption, duration of appliance usage and the number of cycles in a time\nperiod. In our experiments, the time period was one day. We modelled the distributions of these\nsummary statistics by using the methods described in the Section 5.2, where the distributions were\nGaussian. All the required quantities for modelling these distributions were generated by using the\nsamples of the individual appliances.\nHouses without mains readings: In this experiment, we randomly selected one hundred house-\nholds, and one day\u2019s usage was used as test data for each household. Since no mains readings were\nmonitored in these houses, we added up the appliance readings to generate synthetic mains read-\nings. We then applied the AFHMM, AFHMM+PR and AFHMM+LBM to these synthetic mains to\npredict the individual appliance usage. To compare these three methods, we employed four error\nmeasures. Denote \u02c6xi as the inferred signal for the appliance usage xi. One measure is the normal-\n. This measures how well the method predicts the\nized disaggregation error (NDE):\nenergy consumption at every time point. However, the householders might be more interested in the\nsummaries of the appliance usage. For example, in a particular time period, e.g, one day, people\nare interested in the total energy consumption of the appliances, the total time they have been using\nas the\nthose appliances and how many times they have used them. We thus employ 1\nI\nsignal aggregate error (SAE), the duration aggregate error (DAE) or the cycle aggregate error (CAE),\nwhere ri represents the total energy consumption, the duration or the number of cycles, respectively,\nand \u02c6ri represents the predicted summary statistics.\nAll the methods were applied to the synthetic data. Table 1 shows the overall error computed by\nthese methods. We see that both the methods using prior information improved over the base line\nmethod AFHMM. The AFHMM+PR and AFHMM+LBM performed similarly in terms of NDE and\nSAE, but AFHMM+LBM improved over AFHMM+PR in terms of DAE (8%) and CAE (50%).\nHouses with mains readings: We also applied those methods to 6 houses which have mains read-\nings. We used 10 days data for each house, and the recorded mains readings were used as the input\nto the models. All the methods were used to predict the appliance consumption. Table 2 shows the\n\nP\nP\nit(xit\u2212\u02c6xit)2\n\n|\u02c6ri\u2212ri|P\n\n(cid:80)I\n\nit x2\n\nit\n\ni=1\n\ni ri\n\n7\n\n\fTable 3: Normalized disaggregation error (NDE), signal aggregate error (SAE), duration aggregate\nerror (DAE), and cycle aggregate error (CAE) by AFHMM+PR and AFHMM+LBM on UK-DALE\ndata.\n\nNDE\n\nMETHODS\nAFHMM\n\n1.57\u00b11.16\n0.83\u00b10.27\nAFHMM+PR\nAFHMM+LBM 0.84\u00b10.25\n\nSAE\n\n1.99\u00b10.52\n0.82\u00b1 0.38\n0.89\u00b1 0.38\n\nDAE\n\n2.81\u00b10.79\n1.68\u00b1 1.21\n0.49\u00b1 0.33\n\nCAE\n\n1.37 \u00b1 0.28\n1.90 \u00b10.52\n0.59\u00b10.21\n\nTIME (S)\n118.6\u00b123.1\n120.4\u00b125.3\n123.1\u00b125.8\n\nerror of each house and also the overall errors. This experiment is more realistic than the synthetic\nmains readings, since the real mains readings were used as the input. We see that both the meth-\nods incorporating prior information have improved over the AFHMM in terms of NDE, SAE and\nDAE. The AFHMM+PR and AFHMM+LBM have the similar results for SAE. AFHMM+LBM is\nimproved over AFHMM+PR for NDE (15%), DAE (10%) and CAE (40%).\n\n7.2 UK-DALE data\n\nIn the previous section we have trained the model using the HES data, and applied the models to\ndifferent houses of the same data set. A more realistic situation is to train the model in one data set,\nand apply the model to a different data set, because it is unrealistic to expect to obtain appliance-\nlevel data from every household on which the system will be deployed. In this section, we use the\nHES data to train the model parameters of the AFHMM, and model the distribution of the summary\nstatistics. We then apply the models to the UK-DALE dataset [13], which was also gathered from\nUK households, to make the predictions. There are \ufb01ve houses in UK-DALE, and all of them have\nmains readings and as well as the individual appliance readings. All the mains meters were sampled\nevery 6 seconds and some of them also sampled at a higher rate, details of the data and how to access\nit can be found in [13]. We employ three of the houses for analysis in our experiments (houses 1, 2\n& 5 in the data). The other two houses were excluded because the correlation between the sum of\nsubmeters and mains is very low, which suggests that there might be recording errors in the meters.\nWe selected 7 appliances for disaggregation, based on those that typically use the most energy. Since\nthe sample rate of the submeters in the HES data is 2 minutes, we downsampled the signal from 6\nseconds to 2 minutes for the UK-DALE data. For each house, we randomly selected a month for\nanalysis. All the four methods were applied to the mains readings. For comparison purposes, we\ncomputed the NDE, SAE, DAE and CAE errors of all three methods, averaged over 30 days. Table 3\nshows the results. The results are consistent with the results of the HES data. Both the AFHMM+PR\nand AFHMM+LBM improve over the basic AFHMM, except that AFHMM+PR did not improve the\nCAE. As for HES testing data, AFHMM+PR and AFHMM+LBM have similar results on NDE and\nSAE. And AFHMM+LBM again improved over AFHMM+PR in DAE (70%) and CAE (68%).\nThese results are consistent in suggesting that incorporating population information into the model\ncan help to reduce the identi\ufb01ability problem in single-channel BSS problems.\n\n8 Conclusions\n\nWe have proposed a latent Bayesian melding approach for incorporating population information\nwith latent variables into individual models, and have applied the approach to energy disaggregation\nproblems. The new approach has been evaluated by applying it to two real-world electricity data sets.\nThe latent Bayesian melding approach has been compared to the posterior regularization approach\n(a case of the Bayesian melding approach) and AFHMM. Both the LBM and PR have signi\ufb01cantly\nlower error than the base line method. LBM improves over PR in predicting the duration and the\nnumber of cycles. Both methods were similar in NDE and the SAE errors.\n\nAcknowledgments\n\nThis work is supported by the Engineering and Physical Sciences Research Council, UK (grant\nnumbers EP/K002732/1 and EP/M008223/1).\n\n8\n\n\fReferences\n\n[1] Leontine Alkema, Adrian E Raftery, and Samuel J Clark. Probabilistic projections of HIV prevalence\n\nusing Bayesian melding. The Annals of Applied Statistics, pages 229\u2013248, 2007.\n\n[2] MOSEK ApS. The MOSEK optimization toolbox for Python manual. Version 7.1 (Revision 28), 2015.\n[3] Albert-Laszlo Barabasi and Reka Albert.\n\nEmergence of scaling in random networks.\n\nScience,\n\n286(5439):509\u2013512, 1999.\n\n[4] N. Batra et al. Nilmtk: An open source toolkit for non-intrusive load monitoring. In Proceedings of the\n\n5th International Conference on Future Energy Systems, pages 265\u2013276, New York, NY, USA, 2014.\n\n[5] Robert F. Bordley. A multiplicative formula for aggregating probability assessments. Management Sci-\n\nence, 28(10):1137\u20131148, 1982.\n\n[6] Grace S Chiu and Joshua M Gould. Statistical inference for food webs with emphasis on ecological\n\nnetworks via Bayesian melding. Environmetrics, 21(7-8):728\u2013740, 2010.\n\n[7] Luiz Max F de Carvalhoa, Daniel AM Villelaa, Flavio Coelhoc, and Leonardo S Bastosa. On the choice\n\nof the weights for the logarithmic pooling of probability distributions. September 24, 2015.\n\n[8] E. Elhamifar and S. Sastry. Energy disaggregation via learning powerlets and sparse coding. In Proceed-\n\nings of the Twenty-Ninth Conference on Arti\ufb01cial Intelligence (AAAI), pages 629\u2013635, 2015.\n\n[9] K. Ganchev, J. Grac\u00b8a, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable\n\nmodels. Journal of Machine Learning Research, 11:2001\u20132049, 2010.\n\n[10] A. Gif\ufb01n and A. Caticha. Updating probabilities with data and moments. The 27th Int. Workshop on\n\nBayesian Inference and Maximum Entropy Methods in Science and Engineering, NY, July 8-13,2007.\n\n[11] G.W. Hart. Nonintrusive appliance load monitoring. Proceedings of the IEEE, 80(12):1870 \u20131891, 1992.\n[12] Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.\n[13] Jack Kelly and William Knottenbelt. The UK-DALE dataset, domestic appliance-level electricity demand\n\nand whole-house demand from \ufb01ve UK homes. 2(150007), 2015.\n\n[14] H. Kim, M. Marwah, M. Arlitt, G. Lyon, and J. Han. Unsupervised disaggregation of low frequency\n\npower measurements. In Proceedings of the SIAM Conference on Data Mining, pages 747\u2013758, 2011.\n\n[15] J. Z. Kolter and T. Jaakkola. Approximate inference in additive factorial HMMs with application to energy\n\ndisaggregation. In Proceedings of AISTATS, volume 22, pages 1472\u20131482, 2012.\n\n[16] P. Liang, M.I. Jordan, and D. Klein. Learning from measurements in exponential families. In The 26th\n\nAnnual International Conference on Machine Learning, pages 641\u2013648, 2009.\n\n[17] James G MacKinnon and Anthony A Smith. Approximate bias correction in econometrics. Journal of\n\nEconometrics, 85(2):205\u2013230, 1998.\n\n[18] G. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning of conditional\n\nrandom \ufb01elds. In Proceedings of ACL, pages 870\u2013878, Columbus, Ohio, June 2008.\n\n[19] Keith Myerscough, Jason Frank, and Benedict Leimkuhler. Least-biased correction of extended dynami-\n\ncal systems using observational data. arXiv preprint arXiv:1411.6011, 2014.\n\n[20] O. Parson, S. Ghosh, M. Weal, and A. Rogers. Non-intrusive load monitoring using prior models of\n\ngeneral appliance types. In Proceedings of AAAI, pages 356\u2013362, July 2012.\n\n[21] David Poole and Adrian E. Raftery. Inference for deterministic simulation models: The Bayesian melding\n\napproach. Journal of the American Statistical Association, pages 1244\u20131255, 2000.\n\n[22] MJ Rufo, J Mart\u00b4\u0131n, CJ P\u00b4erez, et al. Log-linear pool to combine prior distributions: A suggestion for a\n\ncalibration-based approach. Bayesian Analysis, 7(2):411\u2013438, 2012.\n\n[23] H. \u02c7Sev\u02c7c\u00b4\u0131kov\u00b4a, A. Raftery, and P. Waddell. Uncertain bene\ufb01ts: Application of Bayesian melding to the\nAlaskan way viaduct in Seattle. Transportation Research Part A: Policy and Practice, 45:540\u2013553, 2011.\n[24] Robert L Wolpert. Comment on \u201cInference from a deterministic population dynamics model for bowhead\n\nwhales\u201d. Journal of the American Statistical Association, 90(430):426\u2013427, 1995.\n\n[25] M. Wytock and J. Zico Kolter. Contextually supervised source separation with application to energy\n\ndisaggregation. In Proceedings of AAAI, pages 486\u2013492, 2014.\n\n[26] M. Zhong, N. Goddard, and C. Sutton. Signal aggregate constraints in additive factorial HMMs, with\n\napplication to energy disaggregation. In NIPS, pages 3590\u20133598, 2014.\n\n[27] J.-P. Zimmermann, M. Evans, J. Griggs, N. King, L. Harding, P. Roberts, and C. Evans. Household\n\nelectricity survey, 2012.\n\n9\n\n\f", "award": [], "sourceid": 2025, "authors": [{"given_name": "Mingjun", "family_name": "Zhong", "institution": "University of Edinburgh"}, {"given_name": "Nigel", "family_name": "Goddard", "institution": null}, {"given_name": "Charles", "family_name": "Sutton", "institution": "University of Edinburgh"}]}