{"title": "Energy Disaggregation via Discriminative Sparse Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 1153, "page_last": 1161, "abstract": "Energy disaggregation is the task of taking a whole-home energy signal and separating it into its component appliances. Studies have shown that having device-level energy information can cause users to conserve significant amounts of energy, but current electricity meters only report whole-home data. Thus, developing algorithmic methods for disaggregation presents a key technical challenge in the effort to maximize energy conservation. In this paper, we examine a large scale energy disaggregation task, and apply a novel extension of sparse coding to this problem. In particular, we develop a method, based upon structured prediction, for discriminatively training sparse coding algorithms specifically to maximize disaggregation performance. We show that this significantly improves the performance of sparse coding algorithms on the energy task and illustrate how these disaggregation results can provide useful information about energy usage.", "full_text": "Energy Disaggregation via Discriminative\n\nSparse Coding\n\nJ. Zico Kolter\n\nComputer Science and\n\nArti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nkolter@csail.mit.edu\n\nSiddarth Batra, Andrew Y. Ng\nComputer Science Department\n\nStanford University\nStanford, CA 94305\n\n{sidbatra,ang}@cs.stanford.edu\n\nAbstract\n\nEnergy disaggregation is the task of taking a whole-home energy signal and sep-\narating it into its component appliances. Studies have shown that having device-\nlevel energy information can cause users to conserve signi\ufb01cant amounts of en-\nergy, but current electricity meters only report whole-home data. Thus, developing\nalgorithmic methods for disaggregation presents a key technical challenge in the\neffort to maximize energy conservation. In this paper, we examine a large scale\nenergy disaggregation task, and apply a novel extension of sparse coding to this\nproblem. In particular, we develop a method, based upon structured prediction,\nfor discriminatively training sparse coding algorithms speci\ufb01cally to maximize\ndisaggregation performance. We show that this signi\ufb01cantly improves the perfor-\nmance of sparse coding algorithms on the energy task and illustrate how these\ndisaggregation results can provide useful information about energy usage.\n\nIntroduction\n\n1\nEnergy issues present one of the largest challenges facing our society. The world currently consumes\nan average of 16 terawatts of power, 86% of which comes from fossil fuels [28]; without any effort\nto curb energy consumption or use different sources of energy, most climate models predict that the\nearth\u2019s temperature will increase by at least 5 degrees Fahrenheit in the next 90 years [1], a change\nthat could cause ecological disasters on a global scale. While there are of course numerous facets to\nthe energy problem, there is a growing consensus that many energy and sustainability problems are\nfundamentally informatics problems, areas where machine learning can play a signi\ufb01cant role.\n\nThis paper looks speci\ufb01cally at the task of energy disaggregation, an informatics task relating to\nenergy ef\ufb01ciency. Energy disaggregation, also called non-intrusive load monitoring [11], involves\ntaking an aggregated energy signal, for example the total power consumption of a house as read by\nan electricity meter, and separating it into the different electrical appliances being used. Numerous\nstudies have shown that receiving information about ones energy usage can automatically induce\nenergy-conserving behaviors [6, 19], and these studies also clearly indicate that receiving appliance-\nspeci\ufb01c information leads to much larger gains than whole-home data alone ([19] estimates that\nappliance-level data could reduce consumption by an average of 12% in the residential sector). In\nthe United States, electricity constitutes 38% of all energy used, and residential and commercial\nbuildings together use 75% of this electricity [28]; thus, this 12% \ufb01gure accounts for a sizable\namount of energy that could potentially be saved. However, the widely-available sensors that provide\nelectricity consumption information, namely the so-called \u201cSmart Meters\u201d that are already becoming\nubiquitous, collect energy information only at the whole-home level and at a very low resolution\n(typically every hour or 15 minutes). Thus, energy disaggregation methods that can take this whole-\nhome data and use it to predict individual appliance usage present an algorithmic challenge where\nadvances can have a signi\ufb01cant impact on large-scale energy ef\ufb01ciency issues.\n\n1\n\n\fEnergy disaggregation methods do have a long history in the engineering community, including\nsome which have applied machine learning techniques \u2014 early algorithms [11, 26] typically looked\nfor \u201cedges\u201d in power signal to indicate whether a known device was turned on or off; later work\nfocused on computing harmonics of steady-state power or current draw to determine more complex\ndevice signatures [16, 14, 25, 2]; recently, researchers have analyzed the transient noise of an elec-\ntrical circuit that occurs when a device changes state [15, 21]. However, these and all other studies\nwe are aware of were either conducted in arti\ufb01cial laboratory environments, contained a relatively\nsmall number of devices, trained and tested on the same set of devices in a house, and/or used cus-\ntom hardware for very high frequency electrical monitoring with an algorithmic focus on \u201cevent\ndetection\u201d (detecting when different appliances were turned on and off). In contrast, in this paper\nwe focus on disaggregating electricity using low-resolution, hourly data of the type that is readily\navailable via smart meters (but where most single-device \u201cevents\u201d are not apparent); we speci\ufb01cally\nlook at the generalization ability of our algorithms for devices and homes unseen at training time;\nand we consider a data set that is substantially larger than those previously considered, with 590\nhomes, 10,165 unique devices, and energy usage spanning a time period of over two years.\n\nThe algorithmic approach we present in this paper builds upon sparse coding methods and recent\nwork in single-channel source separation [24, 23, 22]. Speci\ufb01cally, we use a sparse coding algorithm\nto learn a model of each device\u2019s power consumption over a typical week, then combine these\nlearned models to predict the power consumption of different devices in previously unseen homes,\nusing their aggregate signal alone. While energy disaggregation can naturally be formulated as such\na single-channel source separation problem, we know of no previous application of these methods\nto the energy disaggregation task. Indeed, the most common application of such algorithm is audio\nsignal separation, which typically has very high temporal resolution; thus, the low-resolution energy\ndisaggregation task we consider here poses a new set of challenges for such methods, and existing\napproaches alone perform quite poorly.\n\nAs a second major contribution of the paper, we develop a novel approach for discriminatively train-\ning sparse coding dictionaries for disaggregation tasks, and show that this signi\ufb01cantly improves\nperformance on our energy domain. Speci\ufb01cally, we formulate the task of maximizing disaggrega-\ntion performance as a structured prediction problem, which leads to a simple and effective algorithm\nfor discriminatively training such sparse representation for disaggregation tasks. The algorithm is\nsimilar in spirit to a number of recent approaches to discriminative training of sparse representations\n[12, 17, 18]. However, these past works were interested in discriminatively training sparse cod-\ning representation speci\ufb01cally for classi\ufb01cation tasks, whereas we focus here on discriminatively\ntraining the representation for disaggregation tasks, which naturally leads to substantially different\nalgorithmic approaches.\n\n2 Discriminative Disaggregation via Sparse Coding\nWe begin by reviewing sparse coding methods and their application to disaggregation tasks. For con-\ncreteness we use the terminology of our energy disaggregation domain throughout this description,\nbut the algorithms can apply equally to other domains. Formally, assume we are given k differ-\nent classes, which in our setting corresponds to device categories such as televisions, refrigerators,\nheaters, etc. For every i = 1, . . . , k, we have a matrix Xi \u2208 RT \u00d7m where each column of Xi\ncontains a week of energy usage (measured every hour) for a particular house and for this particular\ntype of device. Thus, for example, the jth column of X1, which we denote x(j)\n1 , may contain weekly\nenergy consumption for a refrigerator (for a single week in a single house) and x(j)\ncould contain\n2\nweekly energy consumption of a heater (for this same week in the same house). We denote the\ni=1 Xi so that the jth column of \u00afX,\n\u00afx(j), contains a week of aggregated energy consumption for all devices in a given house. At training\ntime, we assume we have access to the individual device energy readings X1, . . . , Xk (obtained for\nexample from plug-level monitors in a small number of instrumented homes). At test time, however,\nwe assume that we have access only to the aggregate signal of a new set of data points \u00afX\u2032 (as would\nbe reported by smart meter), and the goal is to separate this signal into its components, X\u2032\nk.\n1, . . . , X\u2032\nThe sparse coding approach to source separation (e.g., [24, 23]), which forms for the basis for our\ndisaggregation approach, is to train separate models for each individual class Xi, then use these\nmodels to separate an aggregate signal. Formally, sparse coding models the ith data matrix using the\napproximation Xi \u2248 BiAi where the columns of Bi \u2208 RT \u00d7n contain a set of n basis functions, also\ncalled the dictionary, and the columns of Ai \u2208 Rn\u00d7m contain the activations of these basis functions\n\naggregate power consumption over all device types as \u00afX \u2261 Pk\n\n2\n\n\f[20]. Sparse coding additionally imposes the the constraint that the activations Ai be sparse, i.e.,\nthat they contain mostly zero entries, which allows us to learn overcomplete representations of the\ndata (more basis functions than the dimensionality of the data). A common approach for achieving\nthis sparsity is to add an \u21131 regularization penalty to the activations.\nSince energy usage is an inherently non-negative quantity, we impose the further constraint that the\nactivations and bases be non-negative, an extension known as non-negative sparse coding [13, 7].\nSpeci\ufb01cally, in this paper we will consider the non-negative sparse coding objective\n\nmin\n\nAi\u22650,Bi\u22650\n\n1\n2\n\nkXi \u2212 BiAik2\n\nF + \u03bbXp,q\n\n(Ai)pq\n\nsubject to kb(j)\n\ni k2 \u2264 1, j = 1, . . . , n\n\n(1)\n\n(Pp,q Ypq)1/2 is the Frobenius norm, and kyk2 \u2261 (Pp y2\n\nwhere Xi, Ai, and Bi are de\ufb01ned as above, \u03bb \u2208 R+ is a regularization parameter, kYkF \u2261\np)1/2 is the \u21132 norm. This optimization\nproblem is not jointly convex in Ai and Bi, but it is convex in each optimization variable when\nholding the other \ufb01xed, so a common strategy for optimizing (1) is to alternate between minimizing\nthe objective over Ai and Bi.\nAfter using the above procedure to \ufb01nd representations Ai and Bi for each of the classes i =\n1, . . . , k, we can disaggregate a new aggregate signal \u00afX \u2208 RT \u00d7m\u2032 (without providing the algorithm\nits individual components), using the following procedure (used by, e.g., [23], amongst others). We\nconcatenate the bases to form single joint set of basis functions and solve the optimization problem\n\n\u02c6A1:k = arg min\n\nA1:k\u22650(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nA1:k\u22650\n\n\u00afX \u2212 [B1 \u00b7 \u00b7 \u00b7 Bk]\uf8ee\n\uf8f0\n\nF ( \u00afX, B1:k, A1:k)\n\nA1\n...\nAk\n\n2\n\nF\n\n\uf8f9\n\uf8fb\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n(Ai)pq\n\n+ \u03bbXi,p,q\n\n(2)\n\n\u2261 arg min\n\nwhere for ease of notation we use A1:k as shorthand for A1, . . . , Ak, and we abbreviate the opti-\nmization objective as F ( \u00afX, B1:k, A1:k). We then predict the ith component of the signal to be\n\n\u02c6Xi = Bi \u02c6Ai.\n\n(3)\n\nThe intuition behind this approach is that if Bi is trained to reconstruct the ith class with small\nactivations, then it should be better at reconstructing the ith portion of the aggregate signal (i.e.,\nrequire smaller activations) than all other bases Bj for j 6= i. We can evaluate the quality of the\nresulting disaggregation by what we refer to as the disaggregation error,\n\nE(X1:k, B1:k) \u2261\n\n1\n2\n\nkXi \u2212 Bi \u02c6Aik2\n\nF subject to \u02c6A1:k = arg min\nA1:k\u22650\n\nk\n\nXi=1\n\nF k\nXi=1\n\nXi, B1:k, A1:k! ,\n\n(4)\nwhich quanti\ufb01es how accurately we reconstruct each individual class when using the activations\nobtained only via the aggregated signal.\n\n2.1 Structured Prediction for Discriminative Disaggregation Sparse Coding\nAn issue with using sparse coding alone for disaggregation tasks is that the bases are not trained to\nminimize the disaggregation error. Instead, the method relies on the hope that learning basis func-\ntions for each class individually will produce bases that are distinct enough to also produce small\ndisaggregation error. Furthermore, it is very dif\ufb01cult to optimize the disaggregation error directly\nover B1:k, due to the non-differentiability (and discontinuity) of the argmin operator with a non-\nnegativity constraint. One could imagine an alternating procedure where we iteratively optimize\nover B1:k, ignoring the the dependence of \u02c6A1:k on B1:k, then re-solve for the activations \u02c6A1:k;\nbut ignoring how \u02c6A1:k depends on B1:k loses much of the problem\u2019s structure and this approach\nperforms very poorly in practice. Alternatively, other methods (though in a different context from\ndisaggregation) have been proposed that use a differentiable objective function and implicit differ-\nentiation to explicitly model the derivative of the activations with respect to the basis functions [4];\nhowever, this formulation loses some of the bene\ufb01ts of the standard sparse coding formulation, and\ncomputing these derivatives is a computationally expensive procedure.\n\n3\n\n\fInstead, we propose in this paper a method for optimizing disaggregation performance based upon\nstructured prediction methods [27]. To describe our approach, we \ufb01rst de\ufb01ne the regularized disag-\ngregation error, which is simply the disaggregation error plus a regularization penalty on \u02c6A1:k,\n\nEreg(X1:k, B1:k) \u2261 E(X1:k, B1:k) + \u03bbXi,p,q\n\n( \u02c6Ai)pq\n\n(5)\n\nwhere \u02c6A is de\ufb01ned as in (2). This criterion provides a better optimization objective for our algorithm,\nas we wish to obtain a sparse set of coef\ufb01cients that can achieve low disaggregation error. Clearly,\nthe best possible value of \u02c6Ai for this objective function is given by\n\nA\u22c6\n\ni = arg min\nAi\u22650\n\n1\n2\n\nkXi \u2212 BiAik2\n\n(Ai)pq,\n\n(6)\n\nF + \u03bbXp,q\n\nwhich is precisely the activations obtained after an iteration of sparse coding on the data matrix Xi.\nMotivated by this fact, the \ufb01rst intuition of our algorithm is that in order to minimize disaggregation\nerror, we can discriminatively optimize the bases B1:k that such performing the optimization (2)\nproduces activations that are as close to A\u22c6\n1:k as possible. Of course, changing the bases B1:k to\noptimize this criterion would also change the resulting optimal coef\ufb01cients A\u22c6\n1:k. Thus, the second\nintuition of our method is that the bases used in the optimization (2) need not be the same as the bases\nused to reconstruct the signals. We de\ufb01ne an augmented regularized disaggregation error objective\n\n\u02dcEreg(X1:k, B1:k, \u02dcB1:k) \u2261\n\nk\n\nXi=1 1\n\n2\n\nkXi \u2212 Bi \u02c6Aik2\n\nF + \u03bbXp,q\n\nsubject to \u02c6A1:k = arg min\nA1:k\u22650\n\n( \u02c6Ai)pq!\nF k\nXi=1\n\n(7)\n\nXi, \u02dcB1:k, A1:k! ,\n\nwhere the B1:k bases (referred to as the reconstruction bases) are the same as those learned from\nsparse coding while the \u02dcB1:k bases (refereed to as the disaggregation bases) are discriminatively\noptimized in order to move \u02c6A1:k closer to A\u22c6\nDiscriminatively training the disaggregation bases \u02dcB1:k is naturally framed as a structured prediction\n1:k, the model parameters are \u02dcB1:k, and the\ntask: the input is \u00afX, the multi-variate desired output is A\u22c6\ndiscriminant function is F ( \u00afX, \u02dcB1:k, A1:k).1 In other words, we seek bases \u02dcB1:k such that (ideally)\n\n1:k, without changing these targets.\n\nA\u22c6\n\n1:k = arg min\nA1:k\u22650\n\nF ( \u00afX, \u02dcB1:k, A1:k).\n\n(8)\n\nWhile there are many potential methods for optimizing such a prediction task, we use a simple\nmethod based on the structured perceptron algorithm [5]. Given some value of the parameters \u02dcB1:k,\nwe \ufb01rst compute \u02c6A using (2). We then perform the perceptron update with a step size \u03b1,\n\n\u02dcB1:k \u2190 \u02dcB1:k \u2212 \u03b1(cid:16)\u2207 \u02dcB1:k\n\nF ( \u00afX, \u02dcB1:k, A\u22c6\n\n1:k) \u2212 \u2207 \u02dcB1:k\n\nor more explicitly, de\ufb01ning \u02dcB =h \u02dcB1 \u00b7 \u00b7 \u00b7 \u02dcBki, A\u22c6 =hA\u22c6\n\n1\n\nF ( \u00afX, \u02dcB1:k, \u02c6A1:k)(cid:17)\nTiT\n\nT \u00b7 \u00b7 \u00b7 A\u22c6\n1\n\n(and similarly for \u02c6A),\n\n(9)\n\n(10)\n\n\u02dcB \u2190 \u02dcB \u2212 \u03b1(cid:16)( \u00afX \u2212 \u02dcB \u02c6A) \u02c6AT \u2212 ( \u00afX \u2212 \u02dcBA\u22c6)A\u22c6T(cid:17) .\n\nTo keep \u02dcB1:k in a similar form to B1:k, we keep only the positive part of \u02dcB1:k and we re-normalize\neach column to have unit norm. One item to note is that, unlike typical structured prediction where\nthe discriminant is a linear function in the parameters (which guarantees convexity of the problem),\nhere our discriminant is a quadratic function of the parameters, and so we no longer expect to\nnecessarily reach a global optimum of the prediction problem; however, since sparse coding itself\nis a non-convex problem, this is not overly concerning for our setting. Our complete method for\ndiscriminative disaggregation sparse coding, which we call DDSC, is shown in Algorithm 1.\n\n1The structured prediction task actually involves m examples (where m is the number of columns of \u00afX), and\n1:k)(j), for the jth example \u00afx(j). However, since the function F\nthe goal is to output the desired activations (a\u22c6\ndecomposes across the columns of X and A, the above notation is equivalent to the more explicit formulation.\n\n4\n\n\fAlgorithm 1 Discriminative disaggregation sparse coding\nInput: data points for each individual source Xi \u2208 RT \u00d7m, i = 1, . . . , k, regularization parameter\n\u03bb \u2208 R+, gradient step size \u03b1 \u2208 R+.\nSparse coding pre-training:\n\n1. Initialize Bi and Ai with positive values and scale columns of Bi such that kb(j)\n2. For each i = 1, . . . , k, iterate until convergence:\n\ni k2 = 1.\n\n(a) Ai \u2190 arg minA\u22650 kXi \u2212 BiAk2\n(b) Bi \u2190 arg minB\u22650,kb(j)k2\u22641 kXi \u2212 BAik2\nF\n\nF + \u03bbPp,q Apq\n\nDiscriminative disaggregation training:\n1:k \u2190 A1:k, \u02dcB1:k \u2190 B1:k.\n\n3. Set A\u22c6\n4. Iterate until convergence:\n\n(a) \u02c6A1:k \u2190 arg minA1:k\u22650 F ( \u00afX, \u02dcB1:k, A1:k)\n\n(b) \u02dcB \u2190h \u02dcB \u2212 \u03b1(cid:16)( \u00afX \u2212 \u02dcB \u02c6A) \u02c6AT \u2212 ( \u00afX \u2212 \u02dcBA\u22c6)(A\u22c6)T(cid:17)i+\n\n(c) For all i, j, b(j)\n\ni \u2190 b(j)\nGiven aggregated test examples \u00afX\u2032:\n\ni /kb(j)\n\ni k2.\n\n1:k \u2190 arg minA1:k\u22650 F ( \u00afX\u2032, \u02dcB1:k, A1:k)\n\n5. \u02c6A\u2032\n6. Predict \u02c6X\u2032\n\ni = Bi \u02c6A\u2032\ni.\n\n2.2 Extensions\nAlthough, as we show shortly, the discriminative training procedure has made the largest difference\nin terms of improving disaggregation performance in our domain, a number of other modi\ufb01cations\nto the standard sparse coding formulation have also proven useful. Since these are typically trivial\nextensions or well-known algorithms, we mention them only brie\ufb02y here.\nTotal Energy Priors. One de\ufb01ciency of the sparse coding framework for energy disaggregation\nis that the optimization objective does not take into consideration the size of an energy signal for\ndeterminining which class it belongs to, just its shape. Since total energy used is obviously a dis-\ncriminating factor for different device types, we consider an extension that penalizes the \u21132 deviation\nbetween a device and its mean total energy. Formally, we augment the objective F with the penalty\n\nk\n\nFT EP ( \u00afX, B1:k, A1:k) = F ( \u00afX, B1:k, A1:k) + \u03bbT EP\n\nk\u00b5i1T \u2212 1T BiAik2\n2\n\n(11)\n\nXi=1\n\nwhere 1 denotes a vector of ones of the appropriate size, and \u00b5i = 1\ntotal energy of device class i.\nGroup Lasso. Since the data set we consider exhibits some amount of sparsity at the device level\n(i.e., several examples have zero energy consumed by certain device types, as there is either no such\ndevice in the home or it was not being monitored), we also would like to encourage a grouping effect\nto the activations. That is, we would like a certain coef\ufb01cient being active for a particular class to\nencourage other coef\ufb01cients to also be active in that class. To achieve this, we employ the group\nLasso algorithm [29], which adds an \u21132 norm penalty to the activations of each device\n\nm 1T Xi denotes the average\n\nFGL( \u00afX, B1:k, A1:k) = F ( \u00afX, B1:k, A1:k) + \u03bbGL\n\nk\n\nm\n\nXi=1\n\nXj=1\n\nka(j)\n\ni k2.\n\n(12)\n\nShift Invariant Sparse Coding. Shift invariant, or convolutional sparse coding is an extension\nto the standard sparse coding framework where each basis is convolved over the input data, with\na separate activation for each shift position [3, 10]. Such a scheme may intuitively seem to be\nbene\ufb01cial for the energy disaggregation task, where a given device might exhibit the same energy\nsignature at different times. However, as we will show in the next section, this extension actually\nperform worse in our domain; this is likely due to the fact that, since we have ample training data\n\n5\n\n\fand a relatively low-dimensional domain (each energy signal has 168 dimensions, 24 hours per\nday times 7 days in the week), the standard sparse coding bases are able to cover all possible shift\npositions for typical device usage. However, pure shift invariant bases cannot capture information\nabout when in the week or day each device is typically used, and such information has proven crucial\nfor disaggregation performance.\n\nImplementation\n\n2.3\nSpace constraints preclude a full discussion of the implementation details of our algorithms, but for\nthe most part we rely on standard methods for solving the optimization problems. In particular,\nmost of the time spent by the algorithm involves solving sparse optimization problems to \ufb01nd the\nactivation coef\ufb01cients, namely steps 2a and 4a in Algorithm 1. We use a coordinate descent approach\nhere, both for the standard and group Lasso version of the optimization problems, as these have been\nrecently shown to be ef\ufb01cient algorithms for \u21131-type optimization problems [8, 9], and have the\nadded bene\ufb01t that we can warm-start the optimization with the solution from previous iterations. To\nsolve the optimization over Bi in step 2b, we use the multiplicative non-negative matrix factorization\nupdate from [7].\n\n3 Experimental Results\n3.1 The Plugwise Energy Data Set and Experimental Setup\nWe conducted this work using a data set provided by Plugwise, a European manufacturer of plug-\nlevel monitoring devices. The data set contains hourly energy readings from 10,165 different devices\nin 590 homes, collected over more than two years. Each device is labeled with one of 52 device\ntypes, which we further reduce to ten broad categories of electrical devices: lighting, TV, computer,\nother electronics, kitchen appliances, washing machine and dryer, refrigerator and freezer, dish-\nwasher, heating/cooling, and a miscellaneous category. We look at time periods in blocks of one\nweek, and try to predict the individual device consumption over this week given only the whole-\nhome signal (since the data set does not currently contain true whole-home energy readings, we\napproximate the home\u2019s overall energy usage by aggregating the individual devices). Crucially, we\nfocus on disaggregating data from homes that are absent from the training set (we assigned 70% of\nthe homes to the training set, and 30% to the test set, resulting in 17,133 total training weeks and\n6846 testing weeks); thus, we are attempting to generalize over the basic category of devices, not\njust over different uses of the same device in a single house. We \ufb01t the hyper-parameters of the\nalgorithms (number of bases and regularization parameters) using grid search over each parameter\nindependently on a cross validation set consisting of 20% of the training homes.\n\n3.2 Qualitative Evaluation of the Disaggregation Algorithms\nWe \ufb01rst look qualitatively at the results obtained by the method. Figure 1 shows the true energy en-\nergy consumed by two different houses in the test set for two different weeks, along with the energy\nconsumption predicted by our algorithms. The \ufb01gure shows both the predicted energy of several\ndevices over the whole week, as well as a pie chart that shows the relative energy consumption of\ndifferent device types over the whole week (a more intuitive display of energy consumed over the\nweek). In many cases, certain devices like the refrigerator, washer/dryer, and computer are predicted\nquite accurately, both in terms the total predicted percentage and in terms of the signals themselves.\nThere are also cases where certain devices are not predicted well, such as underestimating the heat-\ning component in the example on the left, and a predicting spike in computer usage in the example\non the right when it was in fact a dishwasher. Nonetheless, despite some poor predictions at the\nhourly device level, the breakdown of electric consumption is still quite informative, determining\nthe approximate percentage of many devices types and demonstrating the promise of such feedback.\n\nIn addition to the disaggregation results themselves, sparse coding representations of the different\ndevice types are interesting in their own right, as they give a good intuition about how the different\ndevices are typically used. Figure 2 shows a graphical representation of the learned basis functions.\nIn each plot, the grayscale image on the right shows an intensity map of all bases functions learned\nfor that device category, where each column in the image corresponds to a learned basis. The plot\non the left shows examples of seven basis functions for the different device types. Notice, for\nexample, that the bases learned for the washer/dryer devices are nearly all heavily peaked, while\nthe refrigerator bases are much lower in maximum magnitude. Additionally, in the basis images\ndevices like lighting demonstrate a clear \u201cband\u201d pattern, indicating that these devices are likely to\n\n6\n\n\f \n\ne\nm\no\nH\ne\no\nh\nW\n\nl\n\n3\n\n2\n\n1\n\n0\n\n \n\nr\ne\n\nt\n\nu\np\nm\no\nC\n\nr\ne\ny\nr\nD\n\n/\nr\ne\nh\ns\na\nW\n\nr\ne\nh\ns\na\nw\nh\ns\nD\n\ni\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\n0.1\n\nr\no\n\nt\n\na\nr\ne\ng\ni\nr\nf\n\ne\nR\n\n0.05\n\n0\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\ng\nn\n\ni\nl\n\no\no\nC\ng\nn\n\n/\n\ni\nt\n\na\ne\nH\n\nActual Energy\n\nPredicted Energy\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n3\n\n3\n\n3\n\n3\n\n3\n\n3\n\n4\n\n4\n\n4\n\n4\n\n4\n\n4\n\n5\n\n5\n\n5\n\n5\n\n5\n\n5\n\n6\n\n6\n\n6\n\n6\n\n6\n\n6\n\nTrue Usage\n\nPredicted Usage\n\n \n\n7\n\n7\n\n7\n\n7\n\n7\n\n7\n\n \n\n \n\ne\nm\no\nH\ne\no\nh\nW\n\nl\n\nr\ne\n\nt\n\nu\np\nm\no\nC\n\nr\ne\ny\nr\nD\n\n/\nr\ne\nh\ns\na\nW\n\nr\ne\nh\ns\na\nw\nh\ns\nD\n\ni\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n \n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n1.5\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\n0.1\n\nr\no\n\nt\n\na\nr\ne\ng\ni\nr\nf\n\ne\nR\n\ng\nn\n\ni\nl\n\no\no\nC\ng\nn\n\n/\n\ni\nt\n\na\ne\nH\n\n0.05\n\n0\n\n0.06\n\n0.04\n\n0.02\n\n0\n\nActual Energy\n\nPredicted Energy\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n3\n\n3\n\n3\n\n3\n\n3\n\n3\n\n4\n\n4\n\n4\n\n4\n\n4\n\n4\n\n5\n\n5\n\n5\n\n5\n\n5\n\n5\n\n6\n\n6\n\n6\n\n6\n\n6\n\n6\n\nTrue Usage\n\nPredicted Usage\n\n \n\n7\n\n7\n\n7\n\n7\n\n7\n\n7\n\n \n\nLighting\nTV\nComputer\nElectronics\nKitchen Appliances\nWasher/Dryer\nDishwasher\nRefrigerator\nHeating/Cooling\nOther\n\n \n\nLighting\nTV\nComputer\nElectronics\nKitchen Appliances\nWasher/Dryer\nDishwasher\nRefrigerator\nHeating/Cooling\nOther\n\n \n\nFigure 1: Example predicted energy pro\ufb01les and total energy percentages (best viewed in color).\nBlue lines show the true energy usage, and red the predicted usage, both in units of kWh.\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\ng\nn\n\ni\nt\n\nh\ng\nL\n\ni\n\nt\n\nr\no\na\nr\ne\ng\nd\ni\nr\nf\ne\nR\n\nr\ne\ny\nr\nD\n\n/\nr\ne\nh\ns\na\nW\n\nFigure 2: Example basis functions learned from three device categories (best viewed in color). The\nplot of the left shows seven example bases, while the image on the right shows all learned basis\nfunctions (one basis per column).\nbe on and off during certain times of the day (each basis covers a week of energy usage, so the seven\nbands represent the seven days). The plots also suggests why the standard implementation of shift\ninvariance is not helpful here. There is suf\ufb01cient training data such that, for devices like washers and\ndryers, we learn a separate basis for all possible shifts. In contrast, for devices like lighting, where\nthe time of usage is an important factor, simple shift-invariant bases miss key information.\n\n3.3 Quantitative Evaluation of the Disaggregation Methods\nThere are a number of components to the \ufb01nal algorithm we have proposed, and in this section\nwe present quantitative results that evaluate the performance of each of these different components.\nWhile many of the algorithmic elements improve the disaggregation performance, the results in this\nsection show that the discriminative training in particular is crucial for optimizing disaggregation\nperformance. The most natural metric for evaluating disaggregation performance is the disaggrega-\ntion error in (4). However, average disaggregation error is not a particularly intuitive metric, and so\nwe also evaluate a total-week accuracy of the prediction system, de\ufb01ned formally as\n\nAccuracy \u2261 Pi,q minnPp(Xi)pq,Pp(Bi \u02c6Ai)pqo\n\n\u00afXp,q\n\nPp,q\n\n7\n\n.\n\n(13)\n\n\fMethod\n\nTraining Set\n\nDisagg. Err.\n\nAcc.\n\nTest Accuracy\n\nDisagg. Err.\n\nAcc.\n\nPredict Mean Energy\n\nSISC\n\nSparse Coding\n\nSparse Coding + TEP\nSparse Coding + GL\n\nSparse Coding + TEP + GL\n\nDDSC\n\nDDSC + TEP\nDDSC + GL\n\nDDSC + TEP + GL\n\n20.98\n20.84\n10.54\n11.27\n10.55\n9.24\n7.20\n8.99\n7.59\n7.92\n\n45.78%\n41.87%\n56.96%\n55.52%\n54.98%\n58.03%\n64.42%\n59.61%\n63.09%\n61.64%\n\n21.72\n24.08\n18.69\n16.86\n17.18\n14.05\n15.59\n15.61\n14.58\n13.20\n\n47.41%\n41.79%\n48.00%\n50.62%\n46.46%\n52.52%\n53.70%\n53.23%\n52.20%\n55.05%\n\nTable 1: Disaggregation results of algorithms (TEP = Total Energy Prior, GL = Group Lasso, SISC\n= Shift Invariant Sparse Coding, DDSC = Discriminative Disaggregation Sparse Coding).\n\n \n\n0.58\n\n0.56\n\n0.54\n\nTraining Set\n\n \n\n0.64\n\n14.5\n\nDisaggregation Error\nAccuracy\n\nTest Set\n\nDisaggregation Error\nAccuracy\n\n9.5\n\n9\n\n8.5\n\n8\n\n7.5\n \n0\n0\n\n20\n20\n\n0.62\n\n0.6\n\n0.58\n\n14\n\n13.5\n\n80\n80\n\n0.56\n\n100\n100\n\n13\n \n0\n0\n\n20\n20\n\n40\n40\n\n60\n60\n\nDDSC Iteration\n\n40\n40\n\n60\n60\n\nDDSC Iteration\n\n80\n80\n\n0.52\n\n100\n100\n\nFigure 3: Evolution of training and testing errors for iterations of the discriminative DDSC updates.\n\nDespite the complex de\ufb01nition, this quantity simply captures the average amount of energy predicted\ncorrectly over the week (i.e., the overlap between the true and predicted energy pie charts).\n\nTable 1 shows the disaggregation performance obtained by many different prediction methods. The\nadvantage of the discriminative training procedure is clear: all the methods employing discrimina-\ntive training perform nearly as well or better than all the methods without discriminative training;\nfurthermore, the system with all the extensions, discriminative training, a total energy prior, and\nthe group Lasso, outperforms all competing methods on both metrics. To put these accuracies in\ncontext, we note that separate to the results presented here we trained an SVM, using a variety\nof hand-engineered features, to classify individual energy signals into their device category, and\nwere able to achieve at most 59% classi\ufb01cation accuracy. It therefore seems unlikely that we could\ndisaggregate a signal to above this accuracy and so, informally speaking, we expect the achievable\nperformance on this particular data set to range between 47% for the baseline of predicting mean en-\nergy (which in fact is a very reasonable method, as devices often follow their average usage patterns)\nand 59% for the individual classi\ufb01cation accuracy. It is clear, then, that the discriminative training\nis crucial to improving the performance of the sparse coding disaggregation procedure within this\nrange, and does provide a signi\ufb01cant improvement over the baseline. Finally, as shown in Figure 3,\nboth the training and testing error decrease reliably with iterations of DDSC, and we have found that\nthis result holds for a wide range of parameter choices and step sizes (though, as with all gradient\nmethods, some care be taken to choose a step size that is not prohibitively large).\n\n4 Conclusion\nEnergy disaggregation is a domain where advances in machine learning can have a signi\ufb01cant impact\non energy use. In this paper we presented an application of sparse coding algorithms to this task,\nfocusing on a large data set that contains the type of low-resolution data readily available from smart\nmeters. We developed the discriminative disaggregation sparse coding (DDSC) algorithm, a novel\ndiscriminative training procedure, and show that this algorithm signi\ufb01cantly improves the accuracy\nof sparse coding for the energy disaggregation task.\nAcknowledgments This work was supported by ARPA-E (Advanced Research Projects Agency\u2013\nEnergy) under grant number DE-AR0000018. We are very grateful to Plugwise for providing us\nwith their plug-level energy data set, and in particular we thank Willem Houck for his assistance\nwith this data. We also thank Carrie Armel and Adrian Albert for helpful discussions.\n\n8\n\n\fReferences\n\n[1] D. Archer. Global Warming: Understanding the Forecast. Blackwell Publishing, 2008.\n[2] M. Berges, E. Goldman, H. S. Matthews, and L Soibelman. Learning systems for electric comsumption\n\nof buildings. In ASCI International Workshop on Computing in Civil Engineering, 2009.\n\n[3] T. Blumensath and M. Davies. On shift-invariant sparse coding. Lecture Notes in Computer Science,\n\n3195(1):1205\u20131212, 2004.\n\n[4] D. Bradley and J.A. Bagnell. Differentiable sparse coding. In Advances in Neural Information Processing\n\nSystems, 2008.\n\n[5] M. Collins. Discriminative training methods for hidden markov models: Theory and experiements with\nIn Proceedings of the Conference on Empirical Methods in Natural Language\n\nperceptron algorithms.\nProcessing, 2002.\n\n[6] S. Darby. The effectiveness of feedback on energy consumption. Technical report, Environmental Change\n\nInstitute, University of Oxford, 2006.\n\n[7] J. Eggert and E. Korner. Sparse coding and NMF. In IEEE International Joint Conference on Neural\n\nNetworks, 2004.\n\n[8] J. Friedman, T. Hastie, H Hoe\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. The Annals of\n\nApplied Statistics, 2(1):302\u2013332, 2007.\n\n[9] J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and a sparse group lasso. Technical\n\nreport, Stanford University, 2010.\n\n[10] R. Grosse, R. Raina, H. Kwong, and A. Y. Ng. Shift-invariant sparse coding for audio classi\ufb01cation. In\n\nProceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2007.\n\n[11] G. Hart. Nonintrusive appliance load monitoring. Proceedings of the IEEE, 80(12), 1992.\n[12] S. Hasler, H. Wersin, and E Korner. Combinging reconstruction and discrimination with class-speci\ufb01c\n\nsparse coding. Neural Computation, 19(7):1897\u20131918, 2007.\n\n[13] P.O. Hoyer. Non-negative sparse coding. In IEEE Workshop on Neural Networks for Signal Processing,\n\n2002.\n\n[14] C. Laughman, K. Lee, R. Cox, S. Shaw, S. Leeb, L. Norford, and P. Armstrong. Power signature analysis.\n\nIEEE Power & Energy Magazine, 2003.\n\n[15] C. Laughman, S. Leeb, and Lee. Advanced non-intrusive monitoring of electric loads. IEEE Power and\n\nEnergy, 2003.\n\n[16] W. Lee, G. Fung, H. Lam, F. Chan, and M. Lucente. Exploration on load signatures.\n\nConference on Electrical Engineering (ICEE), 2004.\n\nInternational\n\n[17] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In Advances\n\nin Neural Information Processing Systems, 2008.\n\n[18] J. Mairal, M. Leordeanu, F. Bach, M. Hebert, and J. Ponce. Discriminative sparse image models for\nIn European Conference on Computer Vision,\n\nclass-speci\ufb01c edge detection and image interpretation.\n2008.\n\n[19] B. Neenan and J. Robinson. Residential electricity use feedback: A research synthesis and economic\n\nframework. Technical report, Electric Power Research Institute, 2009.\n\n[20] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381:607\u2013609, 1996.\n\n[21] S. N. Patel, T. Robertson, J. A. Kientz, M. S. Reynolds, and G. D. Abowd. At the \ufb02ick of a switch: De-\ntecting and classifying unique electrical events on the residential power line. 9th international conference\non Ubiquitous Computing (UbiComp 2007), 2007.\n\n[22] S. T. Roweis. One microphone source separation. In Advances in Neural Information Processing Systems,\n\n2000.\n\n[23] M. N. Schmidt, J. Larsen, and F. Hsiao. Wind noise reduction using non-negative sparse coding. In IEEE\n\nWorkshop on Machine Learning for Signal Processing, 2007.\n\n[24] M N. Schmidt and R. K. Olsson. Single-channel speech separation using sparse non-negative matrix\n\nfactorization. In International Conference on Spoken Language Processing, 2006.\n\n[25] S. R. Shaw, C. B. Abler, R. F. Lepard, D. Luo, S. B. Leeb, and L. K. Norford. Instrumentation for high\n\nperformance nonintrusive electrical load monitoring. ASME, 120(224), 1998.\n\n[26] F. Sultanem. Using appliance signatures for monitoring residential loads at meter panel level.\n\nTransaction on Power Delivery, 6(4), 1991.\n\nIEEE\n\n[27] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models: A large\n\nmargin approach. In International Conference on Machine Learning, 2005.\n\n[28] Various. Annual Energy Review 2009. U.S. Energy Information Administration, 2009.\n[29] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statisical Society, Series B, 68(1):49\u201367, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1272, "authors": [{"given_name": "J.", "family_name": "Kolter", "institution": null}, {"given_name": "Siddharth", "family_name": "Batra", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}