{"title": "Wavelets on Graphs via Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 998, "page_last": 1006, "abstract": "An increasing number of applications require processing of signals defined on weighted graphs. While wavelets provide a flexible tool for signal processing in the classical setting of regular domains, the existing graph wavelet constructions are less flexible -- they are guided solely by the structure of the underlying graph and do not take directly into consideration the particular class of signals to be processed. This paper introduces a machine learning framework for constructing graph wavelets that can sparsely represent a given class of signals. Our construction uses the lifting scheme, and is based on the observation that the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. Particular properties that the resulting wavelets must satisfy determine the training objective and the structure of the involved neural networks. The training is unsupervised, and is conducted similarly to the greedy pre-training of a stack of auto-encoders. After training is completed, we obtain a linear wavelet transform that can be applied to any graph signal in time and memory linear in the size of the graph. Improved sparsity of our wavelet transform for the test signals is confirmed via experiments both on synthetic and real data.", "full_text": "Wavelets on Graphs via Deep Learning\n\nComputer Science Department, Stanford University\n\nRaif M. Rustamov & Leonidas Guibas\n{rustamov,guibas}@stanford.edu\n\nAbstract\n\nAn increasing number of applications require processing of signals de\ufb01ned on\nweighted graphs. While wavelets provide a \ufb02exible tool for signal processing in\nthe classical setting of regular domains, the existing graph wavelet constructions\nare less \ufb02exible \u2013 they are guided solely by the structure of the underlying graph\nand do not take directly into consideration the particular class of signals to be\nprocessed. This paper introduces a machine learning framework for constructing\ngraph wavelets that can sparsely represent a given class of signals. Our construction\nuses the lifting scheme, and is based on the observation that the recurrent nature\nof the lifting scheme gives rise to a structure resembling a deep auto-encoder\nnetwork. Particular properties that the resulting wavelets must satisfy determine the\ntraining objective and the structure of the involved neural networks. The training is\nunsupervised, and is conducted similarly to the greedy pre-training of a stack of\nauto-encoders. After training is completed, we obtain a linear wavelet transform\nthat can be applied to any graph signal in time and memory linear in the size of the\ngraph. Improved sparsity of our wavelet transform for the test signals is con\ufb01rmed\nvia experiments both on synthetic and real data.\n\n1\n\nIntroduction\n\nProcessing of signals on graphs is emerging as a fundamental problem in an increasing number of\napplications [22]. Indeed, in addition to providing a direct representation of a variety of networks\narising in practice, graphs serve as an overarching abstraction for many other types of data. High-\ndimensional data clouds such as a collection of handwritten digit images, volumetric and connectivity\ndata in medical imaging, laser scanner acquired point clouds and triangle meshes in computer graphics\n\u2013 all can be abstracted using weighted graphs. Given this generality, it is desirable to extend the\n\ufb02exibility of classical tools such as wavelets to the processing of signals de\ufb01ned on weighted graphs.\nA number of approaches for constructing wavelets on graphs have been proposed, including, but\nnot limited to the CKWT [7], Haar-like wavelets [24, 10], diffusion wavelets [6], spectral wavelets\n[12], tree-based wavelets [19], average-interpolating wavelets [21], and separable \ufb01lterbank wavelets\n[17]. However, all of these constructions are guided solely by the structure of the underlying graph,\nand do not take directly into consideration the particular class of signals to be processed. While\nthis information can be incorporated indirectly when building the underlying graph (e.g. [19, 17]),\nsuch an approach does not fully exploit the degrees of freedom inherent in wavelet design. In\ncontrast, a variety of signal class speci\ufb01c and adaptive wavelet constructions exist on images and\nmultidimensional regular domains, see [9] and references therein. Bridging this gap is challenging\nbecause obtaining graph wavelets, let alone adaptive ones, is complicated by the irregularity of the\nunderlying space. In addition, theoretical guidance for such adaptive constructions is lacking as it\nremains largely unknown how the properties of the graph wavelet transforms, such as sparsity, relate\nto the structural properties of graph signals and their underlying graphs [22].\nThe goal of our work is to provide a machine learning framework for constructing wavelets on\nweighted graphs that can sparsely represent a given class of signals. Our construction uses the lifting\n\n1\n\n\fscheme as applied to the Haar wavelets, and is based on the observation that the update and predict\nsteps of the lifting scheme are similar to the encode and decode steps of an auto-encoder. From this\npoint of view, the recurrent nature of the lifting scheme gives rise to a structure resembling a deep\nauto-encoder network.\nParticular properties that the resulting wavelets must satisfy, such as sparse representation of signals,\nlocal support, and vanishing moments, determine the training objective and the structure of the\ninvolved neural networks. The goal of achieving sparsity translates into minimizing a sparsity\nsurrogate of the auto-encoder reconstruction error. Vanishing moments and locality can be satis\ufb01ed\nby tying the weights of the auto-encoder in a special way and by restricting receptive \ufb01elds of neurons\nin a manner that incorporates the structure of the underlying graph. The training is unsupervised, and\nis conducted similarly to the greedy (pre-)training [13, 14, 2, 20] of a stack of auto-encoders.\nThe advantages of our construction are three-fold. First, when no training functions are speci\ufb01ed\nby the application, we can impose a smoothness prior and obtain a novel general-purpose wavelet\nconstruction on graphs. Second, our wavelets are adaptive to a class of signals and after training\nwe obtain a linear transform; this is in contrast to adapting to the input signal (e.g. by modifying\nthe underlying graph [19, 17]) which effectively renders those transforms non-linear. Third, our\nconstruction provides ef\ufb01cient and exact analysis and synthesis operators and results in a critically\nsampled basis that respects the multiscale structure imposed on the underlying graph.\nThe paper is organized as follows: in \u00a72 we brie\ufb02y overview the lifting scheme. Next, in \u00a73 we\nprovide a general overview of our approach, and \ufb01ll in the details in \u00a74. Finally, we present a number\nof experiments in \u00a75.\n\n2 Lifting scheme\n\nThe goal of wavelet design is to obtain a multiresolution [16] of L2(G) \u2013 the set of all functions/signals\non graph G. Namely, a nested sequence of approximation spaces from coarse to \ufb01ne of the form\nV1 \u2282 V2 \u2282 ... \u2282 V(cid:96)max = L2(G) is constructed. Projecting a signal in the spaces V(cid:96) provides\nbetter and better approximations with increasing level (cid:96). Associated wavelet/detail spaces W(cid:96)\nsatisfying V(cid:96)+1 = V(cid:96) \u2295 W(cid:96) are also obtained.\nScaling functions {\u03c6(cid:96),k} provide a basis for approximation space V(cid:96), and similarly wavelet functions\n{\u03c8(cid:96),k} for W(cid:96). As a result, for any signal f \u2208 L2(G) on graph and any level (cid:96)0 < (cid:96)max, we have\nthe wavelet decomposition\n\n(cid:88)\n\n(cid:96)max\u22121(cid:88)\n\n(cid:88)\n\nf =\n\na(cid:96)0,k\u03c6(cid:96)0,k +\n\nd(cid:96),k\u03c8(cid:96),k.\n\n(1)\n\nk\n\n(cid:96)=(cid:96)0\n\nk\n\nThe coef\ufb01cients a(cid:96),k and d(cid:96),k appearing in this decomposition are called approximation (also, scaling)\nand detail (also, wavelet) coef\ufb01cients respectively. For simplicity, we use a(cid:96) and d(cid:96) to denote the\nvectors of all approximation and detail coef\ufb01cients at level (cid:96).\nOur construction of wavelets is based on the lifting scheme [23]. Starting with a given wavelet\ntransform, which in our case is the Haar transform (HT ), one can obtain lifted wavelets by applying\nthe process illustrated in Figure 1(left) starting with (cid:96) = (cid:96)max \u2212 1, a(cid:96)max = f and iterating down\nuntil (cid:96) = 1. At every level the lifted coef\ufb01cients a(cid:96) and d(cid:96) are computed by augmenting the Haar\n\nFigure 1: Lifting scheme: one step of forward (left) and backward (right) transform. Here, a(cid:96) and d(cid:96)\ndenote the vectors of all approximation and detail coef\ufb01cients of the lifted transform at level (cid:96). U and\nP are linear update and predict operators. HT and IHT are the Haar transform and its inverse.\n\n2\n\n\fcoef\ufb01cients \u00afa(cid:96) and \u00afd(cid:96) (of the lifted approximation coef\ufb01cients a(cid:96)+1) as follows\n\na(cid:96) \u2190 \u00afa(cid:96) + U \u00afd(cid:96)\nd(cid:96) \u2190 \u00afd(cid:96) \u2212 P a(cid:96)\n\nwhere update (U) and predict (P ) are linear operators (matrices). Note that in adaptive wavelet\ndesigns the update and predict operators will vary from level to level, but for simplicity of notation\nwe do not indicate this explicitly.\nThis process is always invertible \u2013 the backward transform is depicted, with IHT being the inverse\nHaar transform, in Figure 1(right) and allows obtaining perfect reconstruction of the original signal.\nWhile the wavelets and scaling functions are not explicitly computed during either forward or\nbackward transform, it is possible to recover them using the expansion of Eq. (1). For example, to\nobtain a speci\ufb01c scaling function \u03c6(cid:96),k, one simply sets all of approximation and detail coef\ufb01cients to\nzero, except for a(cid:96),k = 1 and runs the backward transform.\n\n3 Approach\n\nFor a given class of signals, our objective is to design wavelets that yield approximately sparse\nexpansions in Eq.(1) \u2013 i.e.\nthe detail coef\ufb01cients are mostly small with a tiny fraction of large\ncoef\ufb01cients. Therefore, we learn the update and predict operators that minimize some sparsity\nsurrogate of the detail (wavelet) coef\ufb01cients of given training functions {f n}nmax\nn=1 .\nFor a \ufb01xed multiresolution level (cid:96), and a training function f n, let \u00afan\nand detail coef\ufb01cient vectors of f n received at level (cid:96) (i.e. applied to an\nConsider the minimization problem\n{U, P} = arg min\n\n(cid:96) be the Haar approximation\n(cid:96)+1as in Figure 1(left)).\n\n(cid:88)\n\n(cid:88)\n\n(cid:96) and \u00afdn\n\n(2)\n\ns( \u00afdn\n\n(cid:96) \u2212 P (\u00afan\n\n(cid:96) + U \u00afdn\n\n(cid:96) )),\n\nU,P\n\nn\n\ns(dn\n\n(cid:96) ) = arg min\nU,P\n\nn\n\n(cid:96) + U \u00afdn\n\nwhere s is some sparse penalty function. This can be seen as optimizing a linear auto-encoder with\nencoding step given by \u00afan\n(cid:96) , and decoding step given by multiplication with the matrix P .\nSince we would like to obtain a linear wavelet transform, the linearity of the encode and decode steps\nis of crucial importance. In addition to linearity and the special form of bias terms, our auto-encoders\ndiffer from commonly used ones in that we enforce sparsity on the reconstruction error, rather than\nthe hidden representation \u2013 in our setting, the reconstruction errors correspond to detail coef\ufb01cients.\nThe optimization problem of Eq. 2 suffers from a trivial solution: by choosing update matrix to have\nlarge norm (e.g. a large coef\ufb01cient times identity matrix), and predict operator equal to the inverse\nof update, one can practically cancel the contribution of the bias terms, obtaining almost perfect\nreconstruction. Trivial solutions are a well-known problem in the context of auto-encoders, and an\neffective solution is to tie the weights of the encode and decode steps by setting U = P t. This also\nhas the bene\ufb01t of decreasing the number of parameters to learn. We also follow a similar strategy and\ntie the weights of update and predict steps, but the speci\ufb01c form of tying is dictated by the wavelet\nproperties and will be discussed in \u00a74.2.\nThe training is conducted in a manner similar to the greedy pre-training of a stack of auto-encoders\n[13, 14, 2, 20]. Namely, we \ufb01rst train the the update and predict operators at the \ufb01nest level: here\nthe input to the lifting step are the original training functions \u2013 this corresponds to (cid:96) = (cid:96)max \u2212 1\nand \u2200n, an\n(cid:96)+1 = f n in Figure 1(left). After training of this \ufb01nest level is completed, we obtain new\napproximation coef\ufb01cients an\n(cid:96) which are passed to the next level as the training functions, and this\nprocess is repeated until one reaches the coarsest level.\nThe use of tied auto-encoders is motivated by their success in deep learning revealing their capability\nto learn useful features from the data under a variety of circumstances. The choice of the lifting\nscheme as the backbone of our construction is motivated by several observations. First, every\ninvertible 1D discrete wavelet transform can be factored into lifting steps [8], which makes lifting\na universal tool for constructing multiresolutions. Second, lifting scheme is always invertible, and\nprovides exact reconstruction of signals. Third, it affords fast (linear time) and memory ef\ufb01cient\n(in-place) implementation after the update and predict operators are speci\ufb01ed. We choose to apply\nlifting to Haar wavelets speci\ufb01cally because Haar wavelets are easy to de\ufb01ne on any underlying space\nprovided that it can be hierarchically partitioned [24, 10]. Our use of update-\ufb01rst scheme mirrors its\n\n3\n\n\fcommon use for adaptive wavelet constructions in image processing literature, which is motivated by\nits stability; see [4] for a thorough discussion.\n\n4 Construction details\n\nWe consider a simple connected weighted graph G with vertex set V of size N. A signal on\nthe graph is represented by a vector f \u2208 RN . Let W be the N \u00d7 N edge weight matrix (since\nthere are no self-loops, Wii = 0), and let S be the diagonal N \u00d7 N matrix of vertex weights;\n\nif no vertex weights are given, we set Sii = (cid:80)\n(cid:80)\n\nj Wij. For\n\u00b4\na graph signal f, we de\ufb01ne its integral over the graph as a\n\u00b4\ni Siif (i). We de\ufb01ne the volume of\nweighted sum,\na subset R of vertices of the graph by V ol(R) =\nR 1 =\n\nG f = (cid:80)\n\ni\u2208R Sii.\n\nWe assume that a hierarchical partitioning (not necessarily\ndyadic) of the underlying graph into connected regions is pro-\nvided. We denote the regions at level (cid:96) = 1, ..., (cid:96)max by R(cid:96),k;\nsee the inset where the three coarsest partition levels of a dataset\nare shown. For each region at levels (cid:96) = 1, ..., (cid:96)max \u2212 1, we\ndesignate arbitrarily all except one of its children (i.e. regions at\nlevel (cid:96)+1) as active regions. As will become clear, our wavelet construction yields one approximation\ncoef\ufb01cient a(cid:96),k for each region R(cid:96),k, and one detail coef\ufb01cient d(cid:96),k for each active region R(cid:96)+1,k at\nlevel (cid:96) + 1. Note that if the partition is not dyadic, at a given level (cid:96) the number of scaling coef\ufb01cients\n(equal to number of regions at level (cid:96)) will not be the same as the number of detail coef\ufb01cients (equal\nto number of active regions at level (cid:96) + 1). We collect all of the coef\ufb01cients at the same level into\nvectors denoted by a(cid:96) and d(cid:96); to keep our notation lightweight, we refrain from using boldface for\nvectors.\n\n4.1 Haar wavelets\n\n\u00b4\n\nR(cid:96),k\n\nUsually, the (unnormalized) Haar approximation and detail coef\ufb01cients of a signal f are computed as\nfollows. The coef\ufb01cient \u00afa(cid:96),k corresponding to region R(cid:96),k equals to the average of the function f on\nthat region: \u00afa(cid:96),k = V ol(R(cid:96),k)\u22121\nf. The detail coef\ufb01cient \u00afd(cid:96),k corresponding to an active region\nR(cid:96)+1,k is the difference between averages at the region R(cid:96)+1,k and its parent region R(cid:96),par(k), namely\n\u00afd(cid:96),k = \u00afa(cid:96)+1,k \u2212 \u00afa(cid:96),par(k). For perfect reconstruction there is no need to keep detail coef\ufb01cients for\ninactive regions, because these can be recovered from the scaling coef\ufb01cient of the parent region and\nthe detail coef\ufb01cients of the sibling regions.\nIn our setting, Haar wavelets are a part of the lifting scheme, and so the coef\ufb01cient vectors \u00afa(cid:96) and \u00afd(cid:96)\nat level (cid:96) need to be computed from the augmented coef\ufb01cient vector a(cid:96)+1 at level (cid:96) + 1 (c.f. Figure\n1(left)). This is equivalent to computing a function\u2019s average at a given region from its averages at the\nchildren regions. As a result, we obtain the following formula:\n\n\u00afa(cid:96),k = V ol(R(cid:96),k)\u22121 (cid:88)\n\na(cid:96)+1,jV ol(R(cid:96)+1,j),\n\nj,par(j)=k\n\nwhere the summation is over all the children regions of R(cid:96),k. As before, the detail coef\ufb01cient\ncorresponding to an active region R(cid:96)+1,k is given by \u00afd(cid:96),k = a(cid:96)+1,k \u2212 \u00afa(cid:96),par(k). The resulting Haar\nwavelets are not normalized; when sorting wavelet/scaling coef\ufb01cients we will multiply coef\ufb01cients\ncoming from level (cid:96) by 2\u2212(cid:96)/2.\n\n4.2 Auto-encoder setup\n\nThe choice of the update and predict operators and their tying scheme is guided by a number of\nproperties that wavelets need to satisfy. We discuss these requirements under separate headings.\n\nVanishing moments: The wavelets should have vanishing dual and primal moments \u2013 two inde-\npendent conditions due to biorthogonality of our wavelets. In terms of the approximation and detail\n\n4\n\n\fcoef\ufb01cients these can be expressed as follows: a) all of the detail coef\ufb01cients of a constant function\nshould be zero and b) the integral of the approximation at any level of multiresolution should be the\nsame as the integral of the original function.\nSince these conditions are already satis\ufb01ed by the Haar wavelets, we need to ensure that the update\nand predict operators preserve them. To be more precise, if a(cid:96)+1 is a constant vector, then we have\nfor Haar coef\ufb01cients that \u00afa(cid:96) = c(cid:126)1 and \u00afd(cid:96) = (cid:126)0; here c is some constant and (cid:126)1 is a column-vector of all\nones. To satisfy a) after lifting, we need to ensure that d(cid:96) = \u00afd(cid:96)\u2212P (\u00afa(cid:96) +U \u00afd(cid:96)) = \u2212P \u00afa(cid:96) = \u2212cP(cid:126)1 = (cid:126)0.\nTherefore, the rows of predict operator should sum to zero: P(cid:126)1 = (cid:126)0.\nTo satisfy b), we need to preserve the \ufb01rst order moment at every level (cid:96) by requiring\nk a(cid:96),kV ol(R(cid:96),k). The \ufb01rst equality is al-\nready satis\ufb01ed (due to the use of Haar wavelets), so we need to constrain our update opera-\ntor.\nIntroducing the diagonal matrix Ac of the region volumes at level (cid:96), we can write 0 =\nk U \u00afd(cid:96)V ol(R(cid:96),k) = (cid:126)1tAcU \u00afd(cid:96). Since this should\n\n(cid:80)\nk a(cid:96)+1,kV ol(R(cid:96)+1,k) = (cid:80)\nk a(cid:96),kV ol(R(cid:96),k) \u2212(cid:80)\n(cid:80)\n\nk \u00afa(cid:96),kV ol(R(cid:96),k) = (cid:80)\n\nk \u00afa(cid:96),kV ol(R(cid:96),k) = (cid:80)\n\nbe satis\ufb01ed for all \u00afd(cid:96), we must have (cid:126)1tAcU = (cid:126)0t.\nTaking these two requirements into consideration, we impose the following constraints on predict and\nupdate weights:\n\nP(cid:126)1 = (cid:126)0\n\nand U = A\u22121\n\nc P tAf\n\nwhere Af is the diagonal matrix of the active region volumes at level (cid:96) + 1. It is easy to check that\n(cid:126)1tAcU = (cid:126)1tAcA\u22121\nc P tAf = (cid:126)1tP tAf = (P(cid:126)1)tAf = (cid:126)0tAf = (cid:126)0t as required. We have introduced\nthe volume matrix Af of regions at the \ufb01ner level to make the update/predict matrices dimensionless\n(i.e. insensitive to whether the volume is measured in any particular units).\n\nLocality: To make our wavelets and scaling functions localized on the graph, we need to constrain\nupdate and predict operators in a way that would disallow distant regions from updating or predicting\nthe approximation/detail coef\ufb01cients of each other.\nSince the update is tied to the predict operator, we can limit ourselves to the latter operator. For a\ndetail coef\ufb01cient d(cid:96),k corresponding to the active region R(cid:96)+1,k, we only allow predictions that come\nfrom the parent region R(cid:96),par(k) and the immediate neighbors of this parent region. Two regions of\ngraph are considered neighboring if their union is a connected graph. This can be seen as enforcing a\nsparsity structure on the matrix P or as limiting the interconnections between the layers of neurons.\nAs a result of this choice, it is not dif\ufb01cult to see that the resulting scaling functions \u03c6(cid:96),k and wavelets\n\u03c8(cid:96),k will be supported in the vicinity of the region R(cid:96),k. Larger supports can be obtained by allowing\nthe use of second and higher order neighbors of the parent for prediction.\n\n4.3 Optimization\n\nA variety of ways for optimizing auto-encoders are available, we refer the reader to the recent paper\n[15] and references therein. In our setting, due to the relatively small size of the training set and sparse\ninter-connectivity between the layers, an off-the-shelf L-BFGS1 unconstrained smooth optimization\npackage works very well. In order to make our problem unconstrained, we avoid imposing the\nequation P(cid:126)1 = (cid:126)0 as a hard constraint, but in each row of P (which corresponds to some active region),\nthe weight corresponding to the parent is eliminated. To obtain a smooth objective, we use L1 norm\n\u0001 + x2 \u2248 |x|, where we set \u0001 = 10\u22124. The initialization is done by\nwith soft absolute value s(x) =\nsetting all of the weights equal to zero. This is meaningful, because it corresponds to no lifting at all,\nand would reproduce the original Haar wavelets.\n\n\u221a\n\n4.4 Training functions\n\nWhen training functions are available we directly use them. However, our construction can be applied\neven if training functions are not speci\ufb01ed. In this case we choose smoothness as our prior, and train\nthe wavelets with a set of smooth functions on the graph \u2013 namely, we use scaled eigenvectors of\ngraph Laplacian corresponding to the smallest eigenvalues. More precisely, let D be the diagonal\n\n1Mark Schmidt, http://www.di.ens.fr/\u02dcmschmidt/Software/minFunc.html\n\n5\n\n\fmatrix with entries Dii =(cid:80)\n\nj Wij. The graph Laplacian L is de\ufb01ned as L = S\u22121(D\u2212W ). We solve\nthe symmetric generalized eigenvalue problem (D\u2212 W )\u03be = \u03bbS\u03be to compute the smallest eigen-pairs\n{\u03bbn, \u03ben}nmax\nn=0 .We discard the 0-th eigen-pair which corresponds to the constant eigenvector, and use\nfunctions {\u03ben/\u03bbn}nmax\nn=1 as our training set. The inverse scaling by the eigenvalue is included because\neigenvectors corresponding to larger eigenvalues are less smooth (cf. [1]), and so should be assigned\nsmaller weights to achieve a smooth prior.\n\n4.5 Partitioning\n\nSince our construction is based on improving upon the Haar wavelets, their quality will have\nan effect on the \ufb01nal wavelets. As proved in [10], the quality of Haar wavelets depends on\nthe quality (balance) of the graph partitioning. From practical standpoint, it is hard to achieve\nhigh quality partitions on all types of graphs using a single algorithm. However, for the datasets\npresented in this paper we \ufb01nd that the following approach based on spectral clustering algo-\nrithm of [18] works well. Namely, we \ufb01rst embed the graph vertices into Rnmax as follows:\ni \u2192 (\u03be1(i)/\u03bb1, \u03be2(i)/\u03bb2, ..., \u03benmax(i)/\u03bbnmax ),\u2200i \u2208 V , where {\u03bbn, \u03ben}nmax\nn=0 are the eigen-pairs\nof the Laplacian as in \u00a74.4, and \u03be\u00b7(i) is the value of the eigenvector at the i-th vertex of the graph.\nTo obtain a hierarchical tree of partitions, we start with the graph itself as the root. At every step, a\ngiven region (a subset of the vertex set) of graph G is split into two children partitions by running\nthe 2-means clustering algorithm (k-means with k = 2) on the above embedding restricted to the\nvertices of the given partition [24]. This process is continued in recursion at every obtained region.\nThis results in a dyadic partitioning except at the \ufb01nest level (cid:96)max.\n\n4.6 Graph construction for point clouds\nOur problem setup started with a weighted graph and arrived to the Laplacian matrix L in \u00a74.4. It is\nalso possible o reverse this process whereby one starts with the Laplacian matrix L and infers from it\nthe weighted graph. This is a natural way of dealing with point clouds sampled from low-dimensional\nmanifolds, a setting common in manifold learning. There is a number of ways for computing\nLaplacians on point clouds, see [5]; almost all of them \ufb01t into the above form L = S\u22121(D \u2212 W ),\nand so, they can be used to infer a weighted graph that can be plugged into our construction.\n\n5 Experiments\n\nOur goal is to experimentally investigate the constructed wavelets for multiscale behavior, mean-\ningful adaptation to training signals, and sparse representation that generalizes to testing signals.\nFor the \ufb01rst two objectives we visualize the scaling func-\ntions at different levels (cid:96) because they provide insight\nabout the signal approximation spaces V(cid:96). The general-\nization performance can be deduced from comparison to\nHaar wavelets, because during training we modify Haar\nwavelets so as to achieve a sparser representation of train-\ning signals.\nWe start with the case of a periodic interval, which is\ndiscretized as a cycle graph; 32 scaled eigenvectors (sines\nand cosines) are used for training. Figure 2 shows the resulting scaling and wavelet functions at level\n(cid:96) = 4. Up to discretization errors, the wavelets and scaling functions at the same level are shifts of\neach other \u2013 showing that our construction is able to learn shift invariance from training functions.\nFigure 3(a) depicts a graph representing the road network of Minnesota, with edges showing the\nmajor roads and vertices being their intersections. In our construction we employ unit weights on\nedges and use 32 scaled eigenvectors of graph Laplacian as training functions. The resulting scaling\nfunctions for regions containing the red vertex in Figure 3(a) are shown at different levels in Figure\n3(b,c,d,e,f). The function values at graph vertices are color coded from smallest (dark blue) to largest\n(dark red). Note that the scaling functions are continuous and show multiscale spatial behavior.\nTo test whether the learned wavelets provide a sparse representation of smooth signals, we syntheti-\ncally generated 100 continuous functions using the xy-coordinates (the coordinates have not been\n\nFigure 2: Scaling (left) and wavelet\n(right) functions on periodic interval.\n\n6\n\n\f(a) Road network\n\n(b) Scaling (cid:96) = 2\n\n(c) Scaling (cid:96) = 4\n\n(d) Scaling (cid:96) = 6\n\n(e) Scaling (cid:96) = 8\n\n(f) Scaling (cid:96) = 10\n\n(h) Reconstruction error\nFigure 3: Our construction trained with smooth prior on the network (a), yields the scaling func-\ntions (b,c,d,e,f). A sample continuous function (g) out of 100 total test functions. Better average\nreconstruction results (h) for our wavelets (Wav-smooth) indicate a good generalization performance.\n\n(g) Sample function\n\nseen by the algorithm so far) of the vertices; Figure 3(g) shows one of such functions. Figure 3(h)\nshows the average error of reconstruction from expansion Eq. (1) with (cid:96)0 = 1 by keeping a speci\ufb01ed\nfraction of largest detail coef\ufb01cients. The improvement over the Haar wavelets shows that our model\ngeneralizes well to unseen signals.\nNext, we apply our approach to real-world graph signals. We use a dataset of average daily tempera-\nture measurements2 from meteorological stations located on the mainland US. The longitudes and\nlatitudes of stations are treated as coordinates of a point cloud, from which a weighted Laplacian is\nconstructed using [5] with 5-nearest neighbors; the resulting graph is shown in Figure 4(a).\nThe daily temperature data for the year of 2012 gives us 366 signals on the graph; Figure 4(b) depicts\none such signal. We use the signals from the \ufb01rst half of the year to train the wavelets, and test\nfor sparse reconstruction quality on the second half of the year (and vice versa). Figure 4(c,d,e,f,g)\ndepicts some of the scaling functions at a number of levels; note that the depicted scaling function at\nlevel (cid:96) = 2 captures the rough temperature distribution pattern of the US. The average reconstruction\nerror from a speci\ufb01ed fraction of largest detail coef\ufb01cients is shown in Figure 4(g).\nAs an application, we employ our wavelets for semi-supervised learning of the temperature distribution\nfor a day from the temperatures at a subset of labeled graph vertices. The sought temperature\n\n(a) GSOD network\n\n(b) April 9, 2012\n\n(c) Scaling (cid:96) = 2\n\n(d) Scaling (cid:96) = 4\n\n(e) Scaling (cid:96) = 6\n\n(f) Scaling (cid:96) = 8\n\n(g) Reconstruction error\n\n(h) Learning error\n\nFigure 4: Our construction on the station network (a) trained with daily temperature data (e.g. (b)),\nyields the scaling functions (c,d,e,f). Reconstruction results (g) using our wavelets trained on data\n(Wav-data) and with smooth prior (Wav-smooth). Results of semi-supervised learning (h).\n\n2National Climatic Data Center, ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2012/\n\n7\n\n\f(a) Scaling functions\n\n(b) PSNR\n\n(c) SSIM\n\nFigure 5: The scaling functions (a) resulting from training on a face images dataset. These wavelets\n(Wav-data) provide better sparse reconstruction quality than the CDF9/7 wavelet \ufb01lterbanks (b,c).\n\ndistribution is expanded as in Eq. (1) with (cid:96)0 = 1, and the coef\ufb01cients are found by solving a least\nsquares problem using temperature values at labeled vertices. Since we expect the detail coef\ufb01cients\nto be sparse, we impose a lasso penalty on them; to make the problem smaller, all detail coef\ufb01cients\nfor levels (cid:96) \u2265 7 are set to zero. We compare to the Laplacian regularized least squares [1] and\nharmonic interpolation approach [26]. A hold-out set of 25 random vertices is used to assign all the\nregularization parameters. The experiment is repeated for each of the days (not used to learn the\nwavelets) with the number of labeled vertices ranging from 10 to 200. Figure 4(h) shows the errors\naveraged over all days; our approach achieves lower error rates than the competitors.\nOur \ufb01nal example serves two purposes \u2013 showing the bene\ufb01ts of our construction in a standard image\nprocessing application and better demonstrating the nature of learned scaling functions. Images\ncan be seen as signals on a graph \u2013 pixels are the vertices and each pixel is connected to its 8\nnearest neighbors. We consider all of the Extended Yale Face Database B [11] images (cropped and\ndown-sampled to 32 \u00d7 32) as a collection of signals on a single underlying graph. We randomly\nsplit the collection into half for training our wavelets, and test their reconstruction quality on the\nremaining half. Figure 5(a) depicts a number of obtained scaling functions at different levels (the\nrows correspond to levels (cid:96) = 4, 5, 6, 7, 8) in various locations (columns). The scaling functions have\na face-like appearance at coarser levels, and capture more detailed facial features at \ufb01ner levels. Note\nthat the scaling functions show controllable multiscale spatial behavior.\nThe quality of reconstruction from a sparse set of detail coef\ufb01cients is plotted in Figure 5(b,c).\nHere again we consider the expansion of Eq. (1) with (cid:96)0 = 1, and reconstruct using a speci\ufb01ed\nproportion of largest detail coef\ufb01cients. We also make a comparison to reconstruction using the\nstandard separable CDF 9/7 wavelet \ufb01lterbanks from bottom-most level; for both of quality metrics,\nour wavelets trained on data perform better than CDF 9/7. The smoothly trained wavelets do not\nimprove over the Haar wavelets, because the smoothness assumption does not hold for face images.\n\n6 Conclusion\n\nWe have introduced an approach to constructing wavelets that take into consideration structural\nproperties of both graph signals and their underlying graphs. An interesting direction for future\nresearch would be to randomize the graph partitioning process or to use bagging over training\nfunctions in order to obtain a family of wavelet constructions on the same graph \u2013 leading to over-\ncomplete dictionaries like in [25]. One can also introduce multiple lifting steps at each level or\neven add non-linearities as common with neural networks. Our wavelets are obtained by training a\nstructure similar to a deep neural network; interestingly, the recent work of Mallat and collaborators\n(e.g. [3]) goes in the other direction and provides a wavelet interpretation of deep neural networks.\nTherefore, we believe that there are ample opportunities for future work in the interface between\nwavelets and deep neural networks.\n\nAcknowledgments: We thank Jonathan Huang for discussions and especially for his advice regard-\ning the experimental section. The authors acknowledge the support of NSF grants FODAVA 808515\nand DMS 1228304, AFOSR grant FA9550-12-1-0372, ONR grant N00014-13-1-0341, a Google\nresearch award, and the Max Plack Center for Visual Computing and Communications.\n\n8\n\n\fReferences\n[1] M. Belkin and P. Niyogi. Semi-supervised learning on riemannian manifolds. Machine Learning, 56(1-\n\n3):209\u2013239, 2004. 4.4, 5\n\n[2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In\nB. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19,\npages 153\u2013160. MIT Press, Cambridge, MA, 2007. 1, 3\n\n[3] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 35(8):1872\u20131886, 2013. 6\n\n[4] R. L. Claypoole, G. Davis, W. Sweldens, and R. G. Baraniuk. Nonlinear wavelet transforms for image\n\ncoding via lifting. IEEE Transactions on Image Processing, 12(12):1449\u20131459, Dec. 2003. 3\n\n[5] R. R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5\u201330,\n\nJuly 2006. 4.6, 5\n\n[6] R. R. Coifman and M. Maggioni. Diffusion wavelets. Appl. Comput. Harmon. Anal., 21(1):53\u201394, 2006. 1\n[7] M. Crovella and E. D. Kolaczyk. Graph wavelets for spatial traf\ufb01c analysis. In INFOCOM, 2003. 1\n[8] I. Daubechies and W. Sweldens. Factoring wavelet transforms into lifting steps. J. Fourier Anal. Appl.,\n\n4(3):245\u2013267, 1998. 3\n\n[9] M. N. Do and Y. M. Lu. Multidimensional \ufb01lter banks and multiscale geometric representations. Founda-\n\ntions and Trends in Signal Processing, 5(3):157\u2013264, 2012. 1\n\n[10] M. Gavish, B. Nadler, and R. R. Coifman. Multiscale wavelets on trees, graphs and high dimensional data:\n\nTheory and applications to semi supervised learning. In ICML, pages 367\u2013374, 2010. 1, 3, 4.5\n\n[11] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone models for face\nrecognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643\u2013660,\n2001. 5\n\n[12] D. K. Hammond, P. Vandergheynst, and R. Gribonval. Wavelets on graphs via spectral graph theory. Appl.\n\nComput. Harmon. Anal., 30(2):129\u2013150, 2011. 1\n\n[13] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Comput.,\n\n18(7):1527\u20131554, 2006. 1, 3\n\n[14] G. E. Hinton and R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science,\n\n313:504\u2013507, July 2006. 1, 3\n\n[15] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On optimization methods for deep\n\nlearning. In ICML, pages 265\u2013272, 2011. 4.3\n\n[16] S. Mallat. A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. Academic Press, 3rd\n\nedition, 2008. 2\n\n[17] S. K. Narang and A. Ortega. Multi-dimensional separable critically sampled wavelet \ufb01lterbanks on arbitrary\n\ngraphs. In ICASSP, pages 3501\u20133504, 2012. 1\n\n[18] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages\n\n849\u2013856, 2001. 4.5\n\n[19] I. Ram, M. Elad, and I. Cohen. Generalized tree-based wavelet transform. IEEE Transactions on Signal\n\nProcessing, 59(9):4199\u20134209, 2011. 1\n\n[20] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Ef\ufb01cient learning of sparse representations with an\nenergy-based model. In B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information\nProcessing Systems 19, pages 1137\u20131144. MIT Press, Cambridge, MA, 2007. 1, 3\n\n[21] R. M. Rustamov. Average interpolating wavelets on point clouds and graphs. CoRR, abs/1110.2227, 2011.\n\n1\n\n[22] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging \ufb01eld of signal\nprocessing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.\nIEEE Signal Process. Mag., 30(3):83\u201398, 2013. 1\n\n[23] W. Sweldens. The lifting scheme: A construction of second generation wavelets. SIAM Journal on\n\nMathematical Analysis, 29(2):511\u2013546, 1998. 2\n\n[24] A. D. Szlam, M. Maggioni, R. R. Coifman, and J. C. Bremer. Diffusion-driven multiscale analysis on\n\nmanifolds and graphs: top-down and bottom-up constructions. In SPIE, volume 5914, 2005. 1, 3, 4.5\n\n[25] X. Zhang, X. Dong, and P. Frossard. Learning of structured graph dictionaries.\n\n3373\u20133376, 2012. 6\n\nIn ICASSP, pages\n\n[26] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian \ufb01elds and harmonic\n\nfunctions. In ICML, pages 912\u2013919, 2003. 5\n\n9\n\n\f", "award": [], "sourceid": 535, "authors": [{"given_name": "Raif", "family_name": "Rustamov", "institution": "Stanford University"}, {"given_name": "Leonidas", "family_name": "Guibas", "institution": "Stanford University"}]}