{"title": "Total Variation Classes Beyond 1d: Minimax Rates, and the Limitations of Linear Smoothers", "book": "Advances in Neural Information Processing Systems", "page_first": 3513, "page_last": 3521, "abstract": "We consider the problem of estimating a function defined over $n$ locations on a $d$-dimensional grid (having all side lengths equal to $n^{1/d}$). When the function is constrained to have discrete total variation bounded by $C_n$, we derive the minimax optimal (squared) $\\ell_2$ estimation error rate, parametrized by $n, C_n$. Total variation denoising, also known as the fused lasso, is seen to be rate optimal. Several simpler estimators exist, such as Laplacian smoothing and Laplacian eigenmaps. A natural question is: can these simpler estimators perform just as well? We prove that these estimators, and more broadly all estimators given by linear transformations of the input data, are suboptimal over the class of functions with bounded variation. This extends fundamental findings of Donoho and Johnstone (1998) on 1-dimensional total variation spaces to higher dimensions. The implication is that the computationally simpler methods cannot be used for such sophisticated denoising tasks, without sacrificing statistical accuracy. We also derive minimax rates for discrete Sobolev spaces over $d$-dimensional grids, which are, in some sense, smaller than the total variation function spaces. Indeed, these are small enough spaces that linear estimators can be optimal---and a few well-known ones are, such as Laplacian smoothing and Laplacian eigenmaps, as we show. Lastly, we investigate the adaptivity of the total variation denoiser to these smaller Sobolev function spaces.", "full_text": "Total Variation Classes Beyond 1d: Minimax Rates,\n\nand the Limitations of Linear Smoothers\n\nVeeranjaneyulu Sadhanala\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nvsadhana@cs.cmu.edu\n\nYu-Xiang Wang\n\nRyan J. Tibshirani\n\nMachine Learning Department\nCarnegie Mellon University\n\nDepartment of Statistics\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nyuxiangw@cs.cmu.edu\n\nPittsburgh, PA 15213\n\nryantibs@stat.cmu.edu\n\nAbstract\n\nWe consider the problem of estimating a function de\ufb01ned over n locations on a\nd-dimensional grid (having all side lengths equal to n1/d). When the function is\nconstrained to have discrete total variation bounded by Cn, we derive the minimax\noptimal (squared) (cid:96)2 estimation error rate, parametrized by n, Cn. Total variation\ndenoising, also known as the fused lasso, is seen to be rate optimal. Several simpler\nestimators exist, such as Laplacian smoothing and Laplacian eigenmaps. A natural\nquestion is: can these simpler estimators perform just as well? We prove that these\nestimators, and more broadly all estimators given by linear transformations of the\ninput data, are suboptimal over the class of functions with bounded variation. This\nextends fundamental \ufb01ndings of Donoho and Johnstone [12] on 1-dimensional total\nvariation spaces to higher dimensions. The implication is that the computationally\nsimpler methods cannot be used for such sophisticated denoising tasks, without\nsacri\ufb01cing statistical accuracy. We also derive minimax rates for discrete Sobolev\nspaces over d-dimensional grids, which are, in some sense, smaller than the total\nvariation function spaces. Indeed, these are small enough spaces that linear estima-\ntors can be optimal\u2014and a few well-known ones are, such as Laplacian smoothing\nand Laplacian eigenmaps, as we show. Lastly, we investigate the adaptivity of the\ntotal variation denoiser to these smaller Sobolev function spaces.\n\n1\n\nIntroduction\n\nLet G = (V, E) be a d-dimensional grid graph, i.e., lattice graph, with equal side lengths. Label the\nnodes as V = {1, . . . , n}, and edges as E = {e1, . . . , em}. Consider data y = (y1, . . . , yn) \u2208 Rn\nobserved over the nodes, from a model\n\nyi \u223c N (\u03b80,i, \u03c32),\n\n(1)\nwhere \u03b80 = (\u03b80,1, . . . , \u03b80,n) \u2208 Rn is an unknown mean parameter to be estimated, and \u03c32 > 0 is the\nmarginal noise variance. It is assumed that \u03b80 displays some kind of regularity over the grid G, e.g.,\n\u03b80 \u2208 Td(Cn) for some Cn > 0, where\n\ni.i.d., for i = 1, . . . , n,\n\nTd(Cn) =(cid:8)\u03b8 : (cid:107)D\u03b8(cid:107)1 \u2264 Cn\n\n(cid:9),\n\n(2)\nand D \u2208 Rm\u00d7n is the edge incidence matrix of G. This has (cid:96)th row D(cid:96) = (0, . . . ,\u22121, . . . , 1, . . . , 0),\nwith a \u22121 in the ith location, and 1 in the jth location, provided that the (cid:96)th edge is e(cid:96) = (i, j) with\ni < j. Equivalently, L = DT D is the graph Laplacian matrix of G, and thus\n\n(cid:107)D\u03b8(cid:107)1 =\n\n|\u03b8i \u2212 \u03b8j|,\n\nand\n\n(cid:107)D\u03b8(cid:107)2\n\n2 = \u03b8T L\u03b8 =\n\n(\u03b8i \u2212 \u03b8j)2.\n\n(cid:88)\n\n(i,j)\u2208E\n\n(cid:88)\n\n(i,j)\u2208E\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fWe will refer to the class in (2) as a discrete total variation (TV) class, and to the quantity (cid:107)D\u03b80(cid:107)1 as\nthe discrete total variation of \u03b80, though for simplicity we will often drop the word \u201cdiscrete\u201d.\nThe problem of estimating \u03b80 given a total variation bound as in (2) is of great importance in both\nnonparametric statistics and signal processing, and has many applications, e.g., changepoint detection\nfor 1d grids, and image denoising for 2d and 3d grids. There has been much methodological and\ncomputational work devoted to this problem, resulting in practically ef\ufb01cient estimators in dimensions\n1, 2, 3, and beyond. However, theoretical performance, and in particularly optimality, is only really\nwell-understood in the 1-dimensional setting. This paper seeks to change that, and offers theory in\nd-dimensions that parallel more classical results known in the 1-dimensional case.\n\nEstimators under consideration. Central role to our work is the total variation (TV) denoising or\nfused lasso estimator (e.g., [21, 25, 7, 15, 27, 23, 2]), de\ufb01ned by the convex optimization problem\n\n(cid:107)y \u2212 \u03b8(cid:107)2\n\n2 + \u03bb(cid:107)D\u03b8(cid:107)1,\n\n\u02c6\u03b8TV = argmin\n\u03b8\u2208Rn\n\n(3)\nwhere \u03bb \u2265 0 is a tuning parameter. Another pair of methods that we study carefully are Laplacian\nsmoothing and Laplacian eigenmaps, which are most commonly seen in the context of clustering,\ndimensionality reduction, and semi-supervised learning, but are also useful tools for estimation in a\nregression setting like ours (e.g., [3, 4, 24, 30, 5, 22]). The Laplacian smoothing estimator is given\nby\n\n\u02c6\u03b8LS = argmin\n\u03b8\u2208Rn\n\n(4)\nfor a tuning parameter \u03bb \u2265 0, where in the second expression we have written \u02c6\u03b8LS in closed-form\n(this is possible since it is the minimizer of a convex quadratic). For Laplacian eigenmaps, we must\nintroduce the eigendecomposition of the graph Laplacian, L = V \u03a3V T , where \u03a3 = diag(\u03c11, . . . , \u03c1n)\nwith 0 = \u03c11 < \u03c12 \u2264 . . . \u2264 \u03c1n, and where V = [V1, V2, . . . , Vn] \u2208 Rn\u00d7n has orthonormal columns.\nThe Laplacian eigenmaps estimator is\n\ni.e.,\n\n(cid:107)y \u2212 \u03b8(cid:107)2\n\n2 + \u03bb(cid:107)D\u03b8(cid:107)2\n2,\n\n\u02c6\u03b8LS = (I + \u03bbL)\u22121y,\n\n\u02c6\u03b8LE = V[k]V T\n\n[k]y, where V[k] = [V1, V2, . . . , Vk] \u2208 Rn\u00d7k,\n\n(5)\n\nwhere now k \u2208 {1, . . . , n} acts as a tuning parameter.\nLaplacian smoothing and Laplacian eigenmaps are appealing because they are (relatively) simple:\nthey are just linear transformations of the data y. Indeed, as we are considering G to be a grid, both\nestimators in (4), (5) can be computed very quickly, in nearly O(n) time, since the columns of V\nhere are discrete cosine transform (DCT) basis vectors when d = 1, or Kronecker products thereof,\nwhen d \u2265 2 (e.g., [9, 17, 20, 28]). The TV denoising estimator in (3), on the other hand, cannot be\nexpressed in closed-form, and is much more dif\ufb01cult to compute, especially when d \u2265 2, though\nseveral advances have been made over the years (see the references above, and in particular [2] for an\nef\ufb01cient operator-splitting algorithm and nice literature survey). Importantly, these computational\ndif\ufb01culties are often worth it: TV denoising often practically outperforms (cid:96)2-regularized estimators\nlike Laplacian smoothing (and also Laplacian eigenmaps) in image denoising tasks, as it is able to\nbetter preserve sharp edges and object boundaries (this is now widely accepted, early references are,\ne.g., [1, 10, 8]). See Figure 1 for an example, using the often-studied \u201ccameraman\u201d image.\nIn the 1d setting, classical theory from nonparametric statistics draws a clear distinction between the\nperformance of TV denoising and estimators like Laplacian smoothing and Laplacian eigenmaps.\nPerhaps surprisingly, this theory has not yet been fully developed in dimensions d \u2265 2. Arguably, the\ncomparison between TV denoising and Laplacian smoothing and Laplacian eigenmaps is even more\ninteresting in higher dimensions, because the computational gap between the methods is even larger\n(the former method being much more expensive, say in 2d and 3d, than the latter two). Shortly, we\nreview the 1d theory, and what is known in d-dimensions, for d \u2265 2. First, we introduce notation.\n\nNotation. For deterministic (nonrandom) sequences an, bn we write an = O(bn) to denote that\nan/bn is upper bounded for all n large enough, and an (cid:16) bn to denote that both an = O(bn) and\na\u22121\nn = O(b\u22121\nn ). Also, for random sequences An, Bn, we write An = OP(Bn) to denote that An/Bn\nis bounded in probability. We abbreviate a\u2227 b = min{a, b} and a\u2228 b = max{a, b}. For an estimator\n\u02c6\u03b8 of the parameter \u03b80 in (1), we de\ufb01ne its mean squared error (MSE) to be\n\nMSE(\u02c6\u03b8, \u03b80) =\n\n(cid:107)\u02c6\u03b8 \u2212 \u03b80(cid:107)2\n2.\n\n1\nn\n\n2\n\n\fNoisy image\n\nLaplacian smoothing\n\nTV denoising\n\nFigure 1: Comparison of Laplacian smoothing and TV denoising for the common \u201ccameraman\u201d image. TV\ndenoising provides a more visually appealing result, and also achieves aboutx a 35% reduction in MSE compared\nto Laplacian smoothing (MSE being measured to the original image). Both methods were tuned optimally.\nThe risk of \u02c6\u03b8 is the expectation of its MSE, and for a set K \u2286 Rn, we de\ufb01ne the minimax risk and\nminimax linear risk to be\n\nE(cid:2)MSE(\u02c6\u03b8, \u03b80)(cid:3)\n\nR(K) = inf\n\u02c6\u03b8\n\nsup\n\u03b80\u2208K\n\nand RL(K) = inf\n\n\u02c6\u03b8 linear\n\nsup\n\u03b80\u2208K\n\nE(cid:2)MSE(\u02c6\u03b8, \u03b80)(cid:3),\n\nrespectively, where the in\ufb01mum on in the \ufb01rst expression is over all estimators \u02c6\u03b8, and in the second\nexpression over all linear estimators \u02c6\u03b8, meaning that \u02c6\u03b8 = Sy for a matrix S \u2208 Rn\u00d7n. We will\nalso refer to linear estimators as linear smoothers. Note that both Laplacian smoothing in (4) and\nLaplacian eigenmaps in (5) are linear smoothers, but TV denoising in (3) is not. Lastly, in somewhat\nof an abuse of nomenclature, we will often call the parameter \u03b80 in (1) a function, and a set of possible\nvalues for \u03b80 as in (2) a function space; this comes from thinking of the components of \u03b80 as the\nevaluations of an underlying function over n locations on the grid. This embedding has no formal\nimportance, but it is convenient notationally, and matches the notation in nonparametric statistics.\n\nReview: TV denoising in 1d. The classical nonparametric statistics literature [13, 12, 18] provides\na more or less complete story for estimation under total variation constraints in 1d. See also [26] for\na translation of these results to a setting more consistent (notationally) to that in the current paper.\nAssume that d = 1 and Cn = C > 0, a constant (not growing with n). The results in [12] imply that\n(6)\n\nR(T1(C)) (cid:16) n\u22122/3.\n\nFurthermore, [18] proved that the TV denoiser \u02c6\u03b8TV in (3), with \u03bb (cid:16) n1/3, satis\ufb01es\n\nMSE(\u02c6\u03b8TV, \u03b80) = OP(n\u22122/3),\n\n(7)\nfor all \u03b80 \u2208 T1(C), and is thus minimax rate optimal over T1(C). (In assessing rates here and through-\nout, we do not distinguish between convergence in expectation versus convergence in probability.)\nWavelet denoising, under various choices of wavelet bases, also achieves the minimax rate. However,\nmany simpler estimators do not. To be more precise, it is shown in [12] that\n\nRL(T1(C)) (cid:16) n\u22121/2.\n\n(8)\nTherefore, a substantial number of commonly used nonparametric estimators\u2014such as running mean\nestimators, smoothing splines, kernel smoothing, Laplacian smoothing, and Laplacian eigenmaps,\nwhich are all linear smoothers\u2014have a major de\ufb01ciency when it comes to estimating functions of\nbounded variation. Roughly speaking, they will require many more samples to estimate \u03b80 within\nthe same degree of accuracy as an optimal method like TV or wavelet denoising (on the order of\n\u0001\u22121/2 times more samples to achieve an MSE of \u0001). Further theory and empirical examples (e.g.,\n[11, 12, 26]) offer the following perspective: linear smoothers cannot cope with functions in T (C)\nthat have spatially inhomogeneous smoothness, i.e., that vary smoothly at some locations and vary\nwildly at others. Linear smoothers can only produce estimates that are smooth throughout, or wiggly\nthroughout, but not a mix of the two. They can hence perform well over smaller, more homogeneous\nfunction classes like Sobolev or Holder classes, but not larger ones like total variation classes (or\nmore generally, Besov and Triebel classes), and for these, one must use more sophisticated, nonlinear\ntechniques. A motivating question: does such a gap persist in higher dimensions, between optimal\nnonlinear and linear estimators, and if so, how big is it?\n\n3\n\n\fReview: TV denoising in multiple dimensions. Recently, [29] established rates for TV denoising\nover various graph models, including grids, and [16] made improvements, particularly in the case of\nd-dimensional grids with d \u2265 2. We can combine Propositions 4 and 6 of [16] with Theorem 3 of\n[29] to give the following result: if d \u2265 2, and Cn is an arbitrary sequence (potentially unbounded\nwith n), then the TV denoiser \u02c6\u03b8TV in (3) satis\ufb01es, over all \u03b80 \u2208 Td(Cn),\n\n(cid:18) Cn\n\n\u221a\n\n(cid:19)\n\n(cid:19)\n(cid:18) Cn log n\nMSE(\u02c6\u03b8TV, \u03b80) = OP\nwith \u03bb (cid:16) log n for d = 2, and \u03bb (cid:16) \u221a\n\nn\n\nresult from the 1d case. We expand on this next.\n\nfor d = 2, and MSE(\u02c6\u03b8TV, \u03b80) = OP\n\nfor d \u2265 3,\n(9)\nlog n for d \u2265 3. Note that, at \ufb01rst glance, this is a very different\n\nlog n\nn\n\n2 Summary of results\nA gap in multiple dimensions. For estimation of \u03b80 in (1) when d \u2265 2, consider, e.g., the simplest\npossible linear smoother: the mean estimator, \u02c6\u03b8mean = \u00afy1 (where 1 = (1, . . . , 1) \u2208 Rn, the vector\nof all 1s). Lemma 4, given below, implies that over \u03b80 \u2208 Td(Cn), the MSE of the mean estimator is\nn/n for d \u2265 3. Compare this to (9). When\nbounded in probability by C 2\nCn = C > 0 is a constant, i.e., when the TV of \u03b80 is assumed to be bounded (which is assumed for\nthe 1d results in (6), (7), (8)), this means that the TV denoiser and the mean estimator converge to \u03b80\nat the same rate, basically (ignoring log terms), the \u201cparametric rate\u201d of 1/n, for estimating a \ufb01nite-\ndimensional parameter! That TV denoising and such a trivial linear smoother perform comparably\nover 2d and 3d grids could not be farther from the story in 1d, where TV denoising is separated by an\nunbridgeable gap from all linear smoothers, as shown in (6), (7), (8).\nOur results in Section 3 clarify this conundrum, and can be summarized by three points.\n\nn log n/n for d = 2, and C 2\n\n\u2022 We argue in Section 3.1 that there is a proper \u201ccanonical\u201d scaling for the TV class de\ufb01ned in\n(2). E.g., when d = 1, this yields Cn (cid:16) 1, a constant, but when d = 2, this yields Cn (cid:16) \u221a\nn,\nand Cn also diverges with n for all d \u2265 3. Sticking with d = 2 as an interesting example,\nwe see that under such a scaling, the MSE rates achieved by TV denoising and the mean\nestimator respectively, are drastically different; ignoring log terms, these are\n\nCn\nn\n\n(cid:16) 1\u221a\nn\n\nand C 2\nn\nn\n\n(cid:16) 1,\n\u221a\nrespectively. Hence, TV denoising has an MSE rate of 1/\nestimator has a constant rate, i.e., a setting where it is not even known to be consistent.\n\nn, in a setting where the mean\n\u2022 We show in Section 3.3 that our choice to study the mean estimator here is not somehow\n\u201cunlucky\u201d (it is not a particularly bad linear smoother, nor is the upper bound on its MSE\nn/n, for all d \u2265 2. Thus, even\nloose): the minimax linear risk over Td(Cn) is on the order C 2\nthe best linear smoothers have the same poor performance as the mean over Td(Cn).\n\u2022 We show in Section 3.2 that the TV estimator is (essentially) minimax optimal over Td(Cn),\n\n(10)\n\nas the minimax risk over this class scales as Cn/n (ignoring log terms).\n\nTo summarize, these results reveal a signi\ufb01cant gap between linear smoothers and optimal estimators\nlike TV denoising, for estimation over Td(Cn) in d dimensions, with d \u2265 2, as long as Cn scales\nappropriately. Roughly speaking, the TV classes encompass a challenging setting for estimation\nbecause they are very broad, containing a wide array of functions\u2014both globally smooth functions,\nsaid to have homogeneous smoothness, and functions with vastly different levels of smoothness at\ndifferent grid locations, said to have heterogeneous smoothness. Linear smoothers cannot handle\nheterogeneous smoothness, and only nonlinear methods can enjoy good estimation properties over\nn,\n\u221a\nn rate (up to log factors), meanwhile, the\n\nthe entirety of Td(Cn). To reiterate, a telling example is d = 2 with the canonical scaling Cn (cid:16) \u221a\n\u221a\nwhere we see that TV denoising achieves the optimal 1/\nbest linear smoothers have max risk that is constant over T2(\n\nn). See Figure 2 for an illustration.\n\nMinimax rates over smaller function spaces, and adaptivity. Sections 4 and 5 are focused on\ndifferent function spaces, discrete Sobolev spaces, which are (cid:96)2 analogs of discrete TV spaces as we\nhave de\ufb01ned them in (2). Under the canonical scaling of Section 3.1, Sobolev spaces are contained in\n\n4\n\n\fTrivial scaling, Cn (cid:16) 1\n\nCanonical scaling, Cn (cid:16) \u221a\n\nn\n\n\u221a\nFigure 2: MSE curves for estimation over a 2d grid, under two very different scalings of Cn: constant and\nn.\nThe parameter \u03b80 was a \u201cone-hot\u201d signal, with all but one component equal to 0. For each n, the results were\naveraged over 5 repetitions, and Laplacian smoothing and TV denoising were tuned for optimal average MSE.\n\nTV spaces, and the former can be roughly thought of as containing functions of more homogeneous\nsmoothness. The story now is more optimistic for linear smoothers, and the following is a summary.\n\u2022 In Section 4, we derive minimax rates for Sobolev spaces, and prove that linear smoothers\u2014\nin particular, Laplacian smoothing and Laplacian eigenmaps\u2014are optimal over these spaces.\n\u2022 In Section 5, we discuss an interesting phenomenon, a phase transition of sorts, at d = 3\ndimensions. When d = 1 or 2, the minimax rates for a TV space and its inscribed Sobolev\nspace match; when d \u2265 3, they do not, and the inscribed Sobolev space has a faster minimax\nrate. Aside from being an interesting statement about the TV and Sobolev function spaces\nin high dimensions, this raises an important question of adaptivity over the smaller Sobolev\nfunction spaces. As the minimax rates match for d = 1 and 2, any method optimal over TV\nspaces in these dimensions, such as TV denoising, is automatically optimal over the inscribed\nSobolev spaces. But the question remains open for d \u2265 3\u2014does, e.g., TV denoising adapt\nto the faster minimax rate over Sobolev spaces? We present empirical evidence to suggest\nthat this may be true, and leave a formal study to future work.\n\nOther considerations and extensions. There are many problems related to the one that we study\nin this paper. Clearly, minimax rates for the TV and Sobolev classes over general graphs, not just\nd-dimensional grids, are of interest. Our minimax lower bounds for TV classes actually apply to\ngeneric graphs with bounded max degree, though it is unclear whether to what extent they are sharp\nbeyond grids; a detailed study will be left to future work. Another related topic is that of higher-order\nsmoothness classes, e.g., classes containing functions whose derivatives are of bounded variation.\nThe natural extension of TV denoising here is called trend \ufb01ltering, de\ufb01ned via the regularization of\ndiscrete higher-order derivatives. In the 1d setting, minimax rates, the optimality of trend \ufb01ltering,\nand the suboptimality of linear smoothers is already well-understood [26]. Trend \ufb01ltering has been\nde\ufb01ned and studied to some extent on general graphs [29], but no notions of optimality have been\ninvestigated beyond 1d. This will also be left to future work. Lastly, it is worth mentioning that there\nare other estimators (i.e., other than the ones we study in detail) that attain or nearly attain minimax\nrates over various classes we consider in this paper. E.g., wavelet denoising is known to be optimal\nover TV classes in 1d [12]; and comparing recent upper bounds from [19, 16] with the lower bounds\nin this work, we see that wavelet denoising is also nearly minimax in 2d (ignoring log terms).\n\n3 Analysis over TV classes\n\n3.1 Canonical scalings for TV and Sobolev classes\nWe start by establishing what we call a \u201ccanonical\u201d scaling for the radius Cn of the TV ball Td(Cn)\nin (2), as well as the radius C(cid:48)\n\nn of the Sobolev ball Sd(C(cid:48)\n\nn), de\ufb01ned as\n\nn) =(cid:8)\u03b8 : (cid:107)D\u03b8(cid:107)2 \u2264 C(cid:48)\n\nn\n\n(cid:9).\n\nSd(C(cid:48)\n\n5\n\n(11)\n\nn102103104105MSE10-410-310-210-1100TVdenoising(-ttedslope-0.88)Laplaciansmoothing(-ttedslope-0.99)Meanestimator(-ttedslope-1.01)Trivialrate:n!1n102103104105MSE10-410-310-210-1100TVdenoising(-ttedslope-0.84)Laplaciansmoothing(-ttedslope-0.01)Meanestimator(-ttedslope0.00)Minimaxrate:n!1=2\fn). To study (2), (11), it helps to introduce a third function space,\n\nProper scalings for Cn, C(cid:48)\nn will be critical for properly interpreting our new results in d dimensions,\n(cid:110)\n\u221a\nin a way that is comparable to known results for d = 1 (which are usually stated in terms of the 1d\nn (cid:16) 1/\nscalings Cn (cid:16) 1, C(cid:48)\n\u03b8 : \u03b8i = f (i1/(cid:96) . . . , id/(cid:96)), i = 1, . . . , n, for some f \u2208 Hcont\nHd(1) =\n\n(12)\nAbove, we have mapped each location i on the grid to a multi-index (i1, . . . , id) \u2208 {1, . . . , (cid:96)}d, where\n(cid:96) = n1/d, and Hcont\n(1) denotes the (usual) continuous Holder space on [0, 1]d, i.e., functions that are\n1-Lipschitz with respect to the (cid:96)\u221e norm. We seek an embedding that is analogous to the embedding\nof continuous Holder, Sobolev, and total variation spaces in 1d functional analysis, namely,\n\n(cid:111)\n\n(1)\n\nd\n\nd\n\n.\n\nHd(1) \u2286 Sd(C(cid:48)\n\nn) \u2286 Td(Cn).\n\nOur \ufb01rst lemma provides a choice of Cn, C(cid:48)\nin this paper, can be found in the supplementary document.\nLemma 1. For d \u2265 1, the embedding in (13) holds with choices Cn (cid:16) n1\u22121/d and C(cid:48)\nSuch choices are called the canonical scalings for the function classes in (2), (11).\n\n(13)\nn that makes the above true. Its proof, as with all proofs\nn (cid:16) n1/2\u22121/d.\n\nAs a sanity check, both the (usual) continuous Holder and Sobolev function spaces in d dimensions\nare known to have minimax risks that scale as n\u22122/(2+d), in a standard nonparametric regression\nn (cid:16) n1/2\u22121/d, our results in Section 4 show that the\nsetup (e.g., [14]). Under the canonical scaling C(cid:48)\ndiscrete Sobolev class Sd(n1/2\u22121/d) also admits a minimax rate of n\u22122/(2+d).\n\n3.2 Minimax rates over TV classes\nThe following is a lower bound for the minimax risk of the TV class Td(Cn) in (2).\nTheorem 2. Assume n \u2265 2, and denote dmax = 2d. Then, for constants c > 0, \u03c11 \u2208 (2.34, 2.35),\n\n(cid:112)1 + log(\u03c3dmaxn/Cn)\n\ndmaxn\n\nmaxn) \u2228 \u03c32/n\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n\u03c3Cn\n\nn/(d2\nC 2\n\u03c32/\u03c11\n\nR(Td(Cn)) \u2265 c \u00b7\n\n\u221a\nif Cn \u2208 [\u03c3dmax\n\u221a\nif Cn < \u03c3dmax\nif Cn > \u03c3dmaxn/\n\nlog n, \u03c3dmaxn/\n\u221a\nlog n\n\u03c11\n\n\u221a\n\n\u03c11]\n\n. (14)\n\n\u221a\n\n\u221a\n\nlog n, \u03c3dmaxn/\n\n(cid:112)log(n/Cn)/n. When d = 2, we see that this is very close to the upper\n\nThe proof uses a simplifying reduction of the TV class, via Td(Cn) \u2287 B1(Cn/dmax), the latter set\ndenoting the (cid:96)1 ball of radius Cn/dmax in Rn. It then invokes a sharp characterization of the minimax\nrisk in normal means problems over (cid:96)p balls due to [6]. Several remarks are in order.\nRemark 1. The \ufb01rst line on the right-hand side in (14) often provides the most useful lower bound.\nTo see this, recall that under the canonical scaling for TV classes, we have Cn = n1\u22121/d. For all\nd \u2265 2, this certainly implies Cn \u2208 [\u03c3dmax\n\u03c11], for large n.\n\u221a\nRemark 2. Even though its construction is very simple, the lower bound on the minimax risk in (14)\nis sharp or nearly sharp in many interesting cases. Assume that Cn \u2208 [\u03c3dmax\n\u03c11].\nThe lower bound rate is Cn\nbound rate of Cn log n/n achieved by the TV denoiser, as stated in (9). These two differ by at most a\nlog n factor (achieved when Cn (cid:16) n). When d \u2265 3, we see that the lower bound rate is even closer\nto the upper bound rate of Cn\nlog n/n achieved by the TV denoiser, as in (9). These two now differ\nlog n factor (again achieved when Cn (cid:16) n). We hence conclude that the TV denoiser\nby at most a\nis essentially minimax optimal in all dimensions d \u2265 2.\nRemark 3. When d = 1, and (say) Cn (cid:16) 1, the lower bound rate of\nlog n/n given by Theorem 2\nis not sharp; we know from [12] (recall (6)) that the minimax rate over T1(1) is n\u22122/3. The result in\nthe theorem (and also Theorem 3) in fact holds more generally, beyond grids: for an arbitrary graph\nG, its edge incidence matrix D, and Td(Cn) as de\ufb01ned in (2), the result holds for dmax equal to the\nmax degree of G. It is unclear to what extent this is sharp, for different graph models.\n\nlog n, \u03c3dmaxn/\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n3.3 Minimax linear rates over TV classes\nWe now turn to a lower bound on the minimax linear risk of the TV class Td(Cn) in (2).\nTheorem 3. Recall the notation dmax = 2d. Then\n\n(cid:32)\n\n(cid:33)\n\nRL(Td(Cn)) \u2265\n\n\u03c32C 2\nn\nn + \u03c32d2\nmaxn\n\nC 2\n\n\u2228 \u03c32\nn\n\n\u2265 1\n2\n\n6\n\nC 2\nn\nd2\nmaxn\n\n\u2227 \u03c32\n\n\u2228 \u03c32\nn\n\n.\n\n(15)\n\n\fThe proof relies on an elegant meta-theorem on minimax rates from [13], which uses the concept of a\n\u201cquadratically convex\u201d set, whose minimax linear risk is the same as that of its hardest rectangular\nsubproblem. An alternative proof can be given entirely from \ufb01rst principles.\nRemark 4. When C 2\nn, at most), the lower bound rate in\nn/n. Compared to the Cn/n minimax rate from Theorem 2 (ignoring log terms), we\n(15) will be C 2\nsee a clear gap between optimal nonlinear and linear estimators. In fact, under the canonical scaling\nCn (cid:16) n1\u22121/d, for any d \u2265 2, this gap is seemingly huge: the lower bound for the minimax linear\nrate will be a constant, whereas the minimax rate from Theorem 2 (ignoring log terms) will be n\u22121/d.\n\nn grows with n, but not too fast (scales as\n\n\u221a\n\nWe now show that the lower bound in Theorem 3 is essentially tight, and remarkably, it is certi\ufb01ed by\nanalyzing two trivial linear estimators: the mean estimator and the identity estimator.\nLemma 4. Let Mn denote the largest column norm of D\u2020. For the mean estimator \u02c6\u03b8mean = \u00afy1,\n\nE(cid:2)MSE(\u02c6\u03b8mean, \u03b80)(cid:3) \u2264 \u03c32 + C 2\n\nnM 2\nn\n\n,\n\nn\n\nsup\n\n\u03b80\u2208Td(Cn)\n\n\u221a\nFrom Proposition 4 in [16], we have Mn = O(\n\nlog n) when d = 2 and Mn = O(1) when d \u2265 3.\nThe risk of the identity estimator \u02c6\u03b8id = y is clearly \u03c32. Combining this logic with Lemma 4 gives the\nn)/n \u2227 \u03c32. Comparing this with the lower bound described\nupper bound RL(Td(Cn)) \u2264 (\u03c32 + C 2\nin Remark 4, we see that the two rates basically match, modulo the M 2\nn factor in the upper bound,\nwhich only provides an extra log n factor when d = 2. The takeaway message: in the sense of max\nrisk, the best linear smoother does not perform much better than the trivial estimators.\nAdditional empirical experiments, similar to those shown in Figure 2, are given in the supplement.\n\nnM 2\n\n4 Analysis over Sobolev classes\nOur \ufb01rst result here is a lower bound on the minimax risk of the Sobolev class Sd(C(cid:48)\nTheorem 5. For a universal constant c > 0,\n\nn) in (11).\n\n(cid:16)\n\nR(Sd(C(cid:48)\n\nn)) \u2265 c\nn\n\n(n\u03c32)\n\n2\n\nd+2 (C(cid:48)\nn)\n\n2d\n\nd+2 \u2227 n\u03c32 \u2227 n2/d(C(cid:48)\n\nn)2(cid:17)\n\n+\n\n\u03c32\nn\n\n.\n\nElegant tools for minimax analysis from [13], which leverage the fact that the ellipsoid Sd(C(cid:48)\northosymmetric and quadratically convex (after a rotation), are used to prove the result.\nThe next theorem gives upper bounds, certifying that the above lower bound is tight, and showing\nthat Laplacian eigenmaps and Laplacian smoothing, both linear smoothers, are optimal over Sd(C(cid:48)\nn).\nn)d)2/(d+2) \u2228 1) \u2227 n, we have\nTheorem 6. For Laplacian eigenmaps, \u02c6\u03b8LE in (5), with k (cid:16) ((n(C(cid:48)\n\nn) is\n\nE(cid:2)MSE(\u02c6\u03b8LE, \u03b80)(cid:3) \u2264 c\n\n(cid:16)\n\nn\n\nsup\n\n\u03b80\u2208Sd(C(cid:48)\nn)\n\n(n\u03c32)\n\n2\n\nd+2 (C(cid:48)\nn)\n\n2d\n\nd+2 \u2227 n\u03c32 \u2227 n2/d(C(cid:48)\n\nn)2(cid:17)\n\n+\n\nc\u03c32\nn\n\n,\n\nfor a universal constant c > 0, and n large enough. When d = 1, 2, or 3, the same bound holds for\nLaplacian smoothing \u02c6\u03b8LS in (5), with \u03bb (cid:16) (n/(C(cid:48)\nn)2)2/(d+2) (and a possibly different constant c).\n\n5 A phase transition, and adaptivity\nThe TV and Sobolev classes in (2) and (11), respectively, display a curious relationship. We re\ufb02ect on\nTheorems 2 and 5, using, for concreteness, the canonical scalings Cn (cid:16) n1\u22121/d and C(cid:48)\nn (cid:16) n1/2\u22121/d\n(that, recall, guarantee Sd(C(cid:48)\nn) \u2286 Td(Cn))). When d = 1, both the TV and Sobolev classes have a\nminimax rate of n\u22122/3 (this TV result is actually due to [12], as stated in (6), not Theorem 2). When\nd = 2, both the TV and Sobolev classes again have the same minimax rate of n\u22121/2, the caveat being\nlog n factor. But for all d \u2265 3, the rates for the canonical TV\nthat the rate for TV class has an extra\nand Sobolev classes differ, and the smaller Sobolev spaces have faster rates than their inscribing TV\nspaces. This may be viewed as a phase transition at d = 3; see Table 1.\nWe may paraphrase to say that 2d is just like 1d, in that expanding the Sobolev ball into a larger TV\nball does not hurt the minimax rate, and methods like TV denoising are automatically adaptive, i.e.,\n\n\u221a\n\n7\n\n\fFunction class\n\nTV ball Td(n1\u22121/d)\n\nSobolev ball Sd(n1/2\u22121/d)\n\nDimension 1 Dimension 2 Dimension d \u2265 3\nlog n\n\nn\u22121/d\u221a\nn\u2212 2\nTable 1: Summary of rates for canonically-scaled TV and Sobolev spaces.\n\nn\u22121/2\u221a\nn\u22121/2\n\nn\u22122/3\nn\u22122/3\n\nlog n\n\n2+d\n\nLinear signal in 2d\n\nLinear signal in 3d\n\nFigure 3: MSE curves for estimating a \u201clinear\u201d signal, a very smooth signal, over 2d and 3d grids. For each n,\nthe results were averaged over 5 repetitions, and Laplacian smoothing and TV denoising were tuned for best\naverage MSE performance. The signal was set to satisfy (cid:107)D\u03b80(cid:107)2 (cid:16) n1/2\u22121/d, matching the canonical scaling.\n\noptimal over both the bigger and smaller classes. However, as soon as we enter the 3d world, it is no\nlonger clear whether TV denoising can adapt to the smaller, inscribed Sobolev ball, whose minimax\nrate is faster, n\u22122/5 versus n\u22121/3 (ignoring log factors). Theoretically, this is an interesting open\nproblem that we do not approach in this paper and leave to future work.\nWe do, however, investigate the matter empirically: see Figure 3, where we run Laplacian smoothing\nand TV denoising on a highly smooth \u201clinear\u201d signal \u03b80. This is constructed so that each component\n\u03b8i is proportional to i1 + i2 + . . . + id (using the multi-index notation (i1, . . . , id) of (12) for grid\nlocation i), and the Sobolev norm is (cid:107)D\u03b80(cid:107)2 (cid:16) n1/2\u22121/d. Arguably, these are among the \u201chardest\u201d\ntypes of functions for TV denoising to handle. The left panel, in 2d, is a case in which we know that\nTV denoising attains the minimax rate; the right panel, in 3d, is a case in which we do not, though\nempirically, TV denoising surely seems to be doing better than the slower minimax rate of n\u22121/3\n(ignoring log terms) that is associated with the larger TV ball.\nEven if TV denoising is shown to be minimax optimal over the inscribed Sobolev balls when d \u2265 3,\nnote that this does not necessarily mean that we should scrap Laplacian smoothing in favor of TV\ndenoising, in all problems. Laplacian smoothing is the unique Bayes estimator in a normal means\nmodel under a certain Markov random \ufb01eld prior (e.g., [22]); statistical decision theory therefore tells\nthat it is admissible, i.e., no other estimator\u2014TV denoising included\u2014can uniformly dominate it.\n\n6 Discussion\nWe conclude with a quote from Albert Einstein: \u201cEverything should be made as simple as possible,\nbut no simpler\u201d. In characterizing the minimax rates for TV classes, de\ufb01ned over d-dimensional grids,\nwe have shown that simple methods like Laplacian smoothing and Laplacian eigenmaps\u2014or even in\nfact, all linear estimators\u2014must be passed up in favor of more sophisticated, nonlinear estimators,\nlike TV denoising, if one wants to attain the optimal max risk. Such a result was previously known\nwhen d = 1; our work has extended it to all dimensions d \u2265 2. We also characterized the minimax\nrates over discrete Sobolev classes, revealing an interesting phase transition where the optimal rates\nover TV and Sobolev spaces, suitably scaled, match when d = 1 and 2 but diverge for d \u2265 3. It is an\nopen question as to whether an estimator like TV denoising can be optimal over both spaces, for all d.\nAcknolwedgements. We thank Jan-Christian Hutter and Philippe Rigollet, whose paper [16] inspired\nus to think carefully about problem scalings (i.e., radii of TV and Sobolev classes) in the \ufb01rst place.\nYW was supported by NSF Award BCS-0941518 to CMU Statistics, a grant by Singapore NRF under\nits International Research Centre @ Singapore Funding Initiative, and a Baidu Scholarship. RT was\nsupported by NSF Grants DMS-1309174 and DMS-1554123.\n\n8\n\nn102103104105MSE10-310-210-1100TVdenoising(-ttedslope-0.54)Laplaciansmoothing(-ttedslope-0.62)TV-ballminimaxrate:n!1=2Sobolev-ballminimaxrate:n!1=2n102103104105MSE10-310-210-1100TVdenoising(-ttedslope-0.44)Laplaciansmoothing(-ttedslope-0.50)TV-ballminimaxrate:n!1=3Sobolev-ballminimaxrate:n!2=5\fReferences\n[1] Robert Acar and Curtis R. Vogel. Analysis of total variation penalty methods. Inverse Problems, 10:\n\n[2] Alvero Barbero and Suvrit Sra. Modular proximal optimization for multidimensional total-variation\n\n1217\u20131229, 1994.\n\nregularization. arXiv: 1411.0589, 2014.\n\n[3] Mikhail Belkin and Partha Niyogi. Using manifold structure for partially labelled classi\ufb01cation. Advances\n\n[4] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representa-\n\nin Neural Information Processing Systems, 15, 2002.\n\ntion. Neural Computation, 15(6):1373\u20131396, 2003.\n\n[5] Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for Laplacian-based manifold methods.\n\nConference on Learning Theory (COLT-05), 18, 2005.\n\n[6] Lucien Birge and Pascal Massart. Gaussian model selection. Journal of the European Mathematical\n\nSociety, 3(3):203\u2013268, 2001.\n\n[7] Antonin Chambolle and Jerome Darbon. On total variation minimization and surface evolution using\n\nparametric maximum \ufb02ows. International Journal of Computer Vision, 84:288\u2013307, 2009.\n\n[8] Antonin Chambolle and Pierre-Louis Lions. Image recovery via total variation minimization and related\n\nproblems. Numerische Mathematik, 76(2):167\u2013188, 1997.\n\n[9] Samuel Conte and Carl de Boor. Elementary Numerical Analysis: An Algorithmic Approach. McGraw-Hill,\n\nNew York, 1980. International Series in Pure and Applied Mathematics.\n\n[10] David Dobson and Fadil Santosa. Recovery of blocky images from noisy and blurred data. SIAM Journal\n\non Applied Mathematics, 56(4):1181\u20131198, 1996.\n\n[11] David Donoho and Iain Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):\n\n[12] David Donoho and Iain Johnstone. Minimax estimation via wavelet shrinkage. Annals of Statistics, 26(8):\n\n425\u2013455, 1994.\n\n879\u2013921, 1998.\n\n[13] David Donoho, Richard Liu, and Brenda MacGibbon. Minimax risk over hyperrectangles, and implications.\n\n[14] Laszlo Gyor\ufb01, Michael Kohler, Adam Krzyzak, and Harro Walk. A Distribution-Free Theory of Nonpara-\n\n[15] Holger Hoe\ufb02ing. A path algorithm for the fused lasso signal approximator. Journal of Computational and\n\n[16] Jan-Christian Hutter and Philippe Rigollet. Optimal rates for total variation denoising. In Conference on\n\n[17] Hans Kunsch. Robust priors for smoothing and image restoration. Annals of the Institute of Statistical\n\nAnnals of Statistics, 18(3):1416\u20131437, 1990.\n\nmetric Regression. Springer, New York, 2002.\n\nGraphical Statistics, 19(4):984\u20131006, 2010.\n\nLearning Theory (COLT-16), 2016. to appear.\n\nMathematics, 46(1):1\u201319, 1994.\n\n387\u2013413, 1997.\n\n[18] Enno Mammen and Sara van de Geer. Locally apadtive regression splines. Annals of Statistics, 25(1):\n\n[19] Deanna Needell and Rachel Ward. Stable image reconstruction using total variation minimization. SIAM\n\nJournal on Imaging Sciences, 6(2):1035\u20131058, 2013.\n\n[20] Michael Ng, Raymond Chan, and Wun-Cheung Tang. A fast algorithm for deblurring models with\n\nNeumann boundary conditions. SIAM Journal on Scienti\ufb01c Computing, 21(3):851\u2013866, 1999.\n\n[21] Leonid Rudin, Stanley Osher, and Emad Faterni. Nonlinear total variation based noise removal algorithms.\n\nPhysica D: Nonlinear Phenomena, 60:259\u2013268, 1992.\n\n[22] James Sharpnack and Aarti Singh. Identifying graph-structured activation patterns in networks. Advances\n\nin Neural Information Processing Systems, 13, 2010.\n\n[23] James Sharpnack, Alessandro Rinaldo, and Aarti Singh. Sparsistency of the edge lasso over graphs.\nProceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, 15:1028\u20131036, 2012.\n[24] Alexander Smola and Risi Kondor. Kernels and regularization on graphs. Proceedings of the Annual\n\nConference on Learning Theory, 16, 2003.\n\n[25] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness\n\nvia the fused lasso. Journal of the Royal Statistical Society: Series B, 67(1):91\u2013108, 2005.\n\n[26] Ryan J. Tibshirani. Adaptive piecewise polynomial estimation via trend \ufb01ltering. Annals of Statistics, 42\n\n[27] Ryan J. Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. Annals of Statistics, 39\n\n(1):285\u2013323, 2014.\n\n(3):1335\u20131371, 2011.\n\n[28] Yilun Wang, Junfeng Yang, Wotao Yin, and Yin Zhang. A new alternating minimization algorithm for\n\ntotal variation image reconstruction. SIAM Journal on Imaging Sciences, 1(3):248\u2013272, 2008.\n\n[29] Yu-Xiang Wang, James Sharpnack, Alex Smola, and Ryan J. Tibshirani. Trend \ufb01ltering on graphs. Journal\n\nof Machine Learning Research, 2016. To appear.\n\n[30] Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty. Semi-supervised learning using Gaussian \ufb01elds and\n\nharmonic functions. International Conference on Machine Learning (ICML-03), 20, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1762, "authors": [{"given_name": "Veeranjaneyulu", "family_name": "Sadhanala", "institution": "Carnegie Mellon University"}, {"given_name": "Yu-Xiang", "family_name": "Wang", "institution": "Carnegie Mellon University"}, {"given_name": "Ryan", "family_name": "Tibshirani", "institution": "Carnegie Mellon University"}]}