{"title": "Time--Data Tradeoffs by Aggressive Smoothing", "book": "Advances in Neural Information Processing Systems", "page_first": 1664, "page_last": 1672, "abstract": "This paper proposes a tradeoff between sample complexity and computation time that applies to statistical estimators based on convex optimization. As the amount of data increases, we can smooth optimization problems more and more aggressively to achieve accurate estimates more quickly. This work provides theoretical and experimental evidence of this tradeoff for a class of regularized linear inverse problems.", "full_text": "Time\u2013Data Tradeo\ufb00s by Aggressive Smoothing\n\nJohn J. Bruer1,*\n\nJoel A. Tropp1\n\nVolkan Cevher2\n\nStephen R. Becker3\n\n1Dept. of Computing + Mathematical Sciences, California Institute of Technology\n\n2Laboratory for Information and Inference Systems, EPFL\n\n3Dept. of Applied Mathematics, University of Colorado at Boulder\n\n*jbruer@cms.caltech.edu\n\nAbstract\n\nThis paper proposes a tradeo\ufb00 between sample complexity and computation time\nthat applies to statistical estimators based on convex optimization. As the amount of\ndata increases, we can smooth optimization problems more and more aggressively\nto achieve accurate estimates more quickly. This work provides theoretical and\nexperimental evidence of this tradeo\ufb00 for a class of regularized linear inverse\nproblems.\n\n1\n\nIntroduction\n\nIt once seemed obvious that the running time of an algorithm should increase with the size of the input.\nBut recent work in machine learning has led us to question this dogma. In particular, Shalev-Shwartz\nand Srebro [1] showed that their algorithm for learning a support vector classi\ufb01er actually becomes\nfaster when they increase the amount of training data. Other researchers have identi\ufb01ed related\ntradeo\ufb00s [2, 3, 4, 5, 6, 7, 8, 9]. Together, these works support an emerging perspective in statistical\ncomputation that treats data as a computational resource that we can exploit to improve algorithms\nfor estimation and learning.\nIn this paper, we consider statistical algorithms based on convex optimization. Our primary contribu-\ntion is the following proposal:\n\nAs the amount of available data increases, we can smooth statistical optimization\nproblems more and more aggressively. We can solve the smoothed problems\nsigni\ufb01cantly faster without any increase in statistical risk.\n\nIndeed, many statistical estimation procedures balance the modeling error with the complexity of the\nmodel. When we have very little data, complexity regularization is essential to \ufb01t an accurate model.\nWhen we have a large amount of data, we can relax the regularization without compromising the\nquality of the model. In other words, excess data o\ufb00ers us an opportunity to accelerate the statistical\noptimization. We propose to use smoothing methods [10, 11, 12] to implement this tradeo\ufb00.\nWe develop this idea in the context of the regularized linear inverse problem (RLIP) with random\ndata. Nevertheless, our ideas apply to a wide range of problems. We pursue a more sophisticated\nexample in a longer version of this work [13].\n\nJJB\u2019s and JAT\u2019s work was supported under ONR award N00014-11-1002, AFOSR award FA9550-09-1-\n0643, and a Sloan Research Fellowship. VC\u2019s work was supported in part by the European Commission under\nGrant MIRG-268398, ERC Future Proof, SNF 200021-132548, SNF 200021-146750 and SNF CRSII2-147633.\nSRB was previously with IBM Research, Yorktown Heights, NY 10598 during the completion of this work.\n\n1\n\n\f1.1 The regularized linear inverse problem\nLet x(cid:92) \u2208 Rd be an unknown signal, and let A \u2208 Rm\u00d7d be a known measurement matrix. Assume\nthat we have access to a vector b \u2208 Rm of m linear samples of that signal given by\n\nb := Ax(cid:92) .\n\nGiven the pair (A, b), we wish to recover the original signal x(cid:92).\nWe consider the case where A is fat (m < d), so we cannot recover x(cid:92) without additional information\nabout its structure. Let us introduce a proper convex function f : Rd \u2192 R \u222a {+\u221e} that assigns small\nvalues to highly structured signals. Using the regularizer f , we construct the estimator\n(1)\n\nWe declare the estimator successful when(cid:68)x = x(cid:92), and we refer to this outcome as exact recovery.\n\n(cid:68)x := arg min\n\nsubject to Ax = b.\n\nf (x)\n\nx\n\nWhile others have studied (1) in the statistical setting, our result is di\ufb00erent in character from previous\nwork. Agarwal, Negahban, and Wainwright [14] showed that gradient methods applied to problems\nlike (1) converge in fewer iterations due to increasing restricted strong convexity and restricted\nsmoothness as sample size increases. They did not, however, discuss a time\u2013data tradeo\ufb00 explicitly,\nnor did they recognize that the overall computational cost may rise as the problem sizes grow.\nLai and Yin [15], meanwhile, proposed relaxing the regularizer in (1) based solely on some norm\nof the underlying signal. Our relaxation, however, is based on the sample size as well. Our method\nresults in better performance as sample size increases: a time\u2013data tradeo\ufb00.\nThe RLIP (1) provides a good candidate for studying time\u2013data tradeo\ufb00s because recent work in\nconvex geometry [16] gives a precise characterization of the number of samples needed for exact\nrecovery. Excess samples allow us to replace the optimization problem (1) with one that we can solve\nfaster. We do this for sparse vector and low-rank matrix recovery problems in Sections 4 and 5.\n\n2 The geometry of the time\u2013data tradeo\ufb00\n\nIn this section, we summarize the relevant results that describe the minimum sample size required to\nsolve the regularized linear inverse problem (1) exactly in a statistical setting.\n\n2.1 The exact recovery condition and statistical dimension\n\nWe can state the optimality condition for (1) in a geometric form; cf. [17, Prop. 2.1].\nFact 2.1 (Exact recovery condition). The descent cone of a proper convex function f : Rd \u2192 R\u222a{+\u221e}\nat the point x is the convex cone\n\n(cid:41)\n\nD( f ; x) :=\n\ny \u2208 Rd : f (x + \u03c4y) \u2264 f (x)\n\n.\n\n(cid:40)\n\n(cid:91)\n\n\u03c4 >0\n\nThe regularized linear inverse problem (1) exactly recovers the unknown signal x(cid:92) if and only if\n\nD( f ; x(cid:92)) \u2229 null(A) = {0}.\n\n(2)\n\nWe illustrate this condition in Figure 1(a).\nTo determine the number of samples we need to ensure that the exact recovery condition (2) holds,\nwe must quantify the \u201csize\u201d of the descent cones of the regularizer f .\nDe\ufb01nition 2.2 (Statistical dimension [16, Def. 2.1]). Let C \u2208 Rd be a convex cone. Its statistical\ndimension \u03b4(C) is de\ufb01ned as\n\n(cid:102)\n\n(cid:103)\n\n\u03b4(C) := E\n\n(cid:107)\u03a0C (g)(cid:107)2\n\n,\n\nwhere g \u2208 Rd has independent standard Gaussian entries, and \u03a0C is the projection operator onto C.\nWhen the measurement matrix A is su\ufb03ciently random, Amelunxen et al. [16] obtain a precise\ncharacterization of the number m of samples required to achieve exact recovery.\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: The geometric opportunity. Panel (a) illustrates the exact recovery condition (2). Panel (b)\nshows a relaxed regularizer \u02dcf with larger sublevel sets. The shaded area indicates the di\ufb00erence\nbetween the descent cones of \u02dcf and f at x(cid:92). When we have excess samples, Fact 2.3 tells us that\nthe exact recovery condition holds with high probability, as in panel (a). A suitable relaxation will\nmaintain exact recovery, as in panel (b), while allowing us to solve the problem faster.\n\nFact 2.3 (Exact recovery condition for the random RLIP [16, Thm. II]). Assume that the null space\nof the measurement matrix A \u2208 Rm\u00d7d in the RLIP (1) is oriented uniformly at random. (In particular,\na matrix with independent standard Gaussian entries has this property.) Then\n\n+ C\u03b7\u221ad =\u21d2 exact recovery holds with probability \u2265 1 \u2212 \u03b7;\n\u2212 C\u03b7\u221ad =\u21d2 exact recovery holds with probability \u2264 \u03b7,\n\n(cid:17)\n(cid:17)\n\n(cid:16)\n(cid:16)\n\nm \u2265 \u03b4\n(cid:112)\nm \u2264 \u03b4\n\nD( f ; x(cid:92))\nD( f ; x(cid:92))\n8 log(4/\u03b7).\n\nwhere C\u03b7 :=\nIn words, the RLIP undergoes a phase transition when the number m of samples equals \u03b4(D( f ; x(cid:92))).\nAny additional samples are redundant, so we can try to exploit them to identify x(cid:92) more quickly.\n\n2.2 A geometric opportunity\n\nChandrasekaran and Jordan [6] have identi\ufb01ed a time\u2013data tradeo\ufb00 in the setting of denoising\nproblems based on Euclidean projection onto a constraint set. They argue that, when they have a large\nnumber of samples, it is possible to enlarge the constraint set without increasing the statistical risk of\nthe estimator. They propose to use a discrete sequence of relaxations based on algebraic hierarchies.\nWe have identi\ufb01ed a related opportunity for a time\u2013data tradeo\ufb00 in the RLIP (1). When we have\nexcess samples, we may replace the regularizer f with a relaxed regularizer \u02dcf that is easier to optimize.\nIn contrast to [6], we propose to use a continuous sequence of relaxations based on smoothing.\nFigure 1 illustrates the geometry of our time\u2013data tradeo\ufb00. When the number of samples exceeds\n\u03b4(D( f ; x(cid:92))), Fact 2.3 tells us that the situation shown in Figure 1(a) holds with high probability.\nThis allows us to enlarge the sublevel sets of the regularizer while still satisfying the exact recovery\ncondition, as shown in Figure 1(b). A suitable relaxation allows us to solve the problem faster. Our\ngeometric motivation is similar with [6] although our relaxation method is totally unrelated.\n\n3 A time\u2013data tradeo\ufb00 via dual-smoothing\n\nThis section presents an algorithm that can exploit excess samples to solve the RLIP (1) faster.\n\n3.1 The dual-smoothing procedure\n\nThe procedure we use applies Nesterov\u2019s primal-smoothing method from [11] to the dual problem;\nsee [12]. Given a regularizer f , we introduce a family { f \u00b5 : \u00b5 > 0} of strongly convex majorants:\n\nf \u00b5 (x) := f (x) +\n\n\u00b5\n\n2 (cid:107)x(cid:107)2 .\n\n3\n\nx\u00danull\u0001A\u0001+x\u00daD\u0002f,x\u00da\u0002+x\u00da\u0002x:f\u0001x\u0001\u00a3f\u0002x\u00da\u0002\u0002\u0002x:f\u017d\u0001x\u0001\u00a3f\u017d\u0002x\u00da\u0002\u0002x\u00danull\u0001A\u0001+x\u00da\fAlgorithm 3.1 Auslender\u2013Teboulle applied to the dual-smoothed RLIP\nInput: measurement matrix A, observed vector b\n1: z0 \u2190 0, \u00afz0 \u2190 z0, \u03b80 \u2190 1\n2: for k = 0, 1, 2, . . . do\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nyk \u2190 (1 \u2212 \u03b8k )zk + \u03b8k \u00afzk\nxk \u2190 arg minx f (x) +\n\u00afzk +1 \u2190 \u00afzk +\n(cid:107)A(cid:107)2\u03b8 (b \u2212 Axk )\nzk +1 \u2190 (1 \u2212 \u03b8k )zk + \u03b8k \u00afzk +1\nk )1/2)\n\u03b8k +1 \u2190 2/(1 + (1 + 4/\u03b82\n\n2 (cid:107)x(cid:107)2 \u2212 (cid:104)yk, Ax \u2212 b(cid:105)\n\n\u00b5\n\n\u00b5\n\nIn particular, the sublevel sets of f \u00b5 grow as \u00b5 increases. We then replace f with f \u00b5 in the original\nRLIP (1) to obtain new estimators of the form\n\n(cid:68)x\u00b5 := arg min\n\nx\n\nf \u00b5 (x)\n\nsubject to Ax = b.\n\n(3)\n\n(4)\n\n(5)\n\nThe Lagrangian of the convex optimization problem (3) becomes\n\nwhere the Lagrange multiplier z is a vector in Rm. This gives a family of dual problems:\n\nL\u00b5 (x, z) = f (x) +\n\n2 (cid:107)x(cid:107)2 \u2212 (cid:104)z, Ax \u2212 b(cid:105) ,\n\n\u00b5\n\nmaximize\n\ng\u00b5 (z) := min\n\nx L\u00b5 (x, z)\n\nsubject to z \u2208 Rm .\n\nSince f \u00b5 is strongly convex, the Lagrangian L has a unique minimizer xz for each dual point z:\n\nxz := arg min\n\nx L\u00b5 (x, z).\n\nStrong duality holds for (3) and (4) by Slater\u2019s condition [18, Sec. 5.2.3]. Therefore, if we solve the\ndual problem (4) to obtain an optimal dual point, (5) returns the unique optimal primal point.\nThe dual function is di\ufb00erentiable with \u2207g\u00b5 (z) = b \u2212 Axz, and the gradient is Lipschitz-continuous\nwith Lipschitz constant L \u00b5 no larger than \u00b5\u22121 (cid:107)A(cid:107)2; see [12, 11]. Note that L \u00b5 is decreasing in \u00b5,\nand so we call \u00b5 the smoothing parameter.\n\n3.2 Solving the smoothed dual problem\n\nIn order to solve the smoothed dual problem (4), we apply the fast gradient method from Auslender\nand Teboulle [19]. We present the pseudocode in Algorithm 3.1.\nThe computational cost of the algorithm depends on two things: the number of iterations necessary\nfor convergence and the cost of each iteration. The following result bounds the error of the primal\niterates xk with respect to the true signal x(cid:92). The proof is in the supplemental material.\nProposition 3.1 (Primal convergence of Algorithm 3.1). Assume that the exact recovery condition\nholds for the primal problem (3). Algorithm 3.1 applied to the smoothed dual problem (4) converges\nto an optimal dual point z(cid:63)\n\u00b5 be the corresponding optimal primal point given by (5). Then\nthe sequence of primal iterates {xk} satis\ufb01es\n\n\u00b5. Let x(cid:63)\n\n(cid:107)x(cid:92) \u2212 xk(cid:107) \u2264\n\n2 (cid:107)A(cid:107) (cid:107)z(cid:63)\n\u00b5(cid:107)\n\n.\n\n\u00b5 \u00b7 k\n\nThe chosen regularizer a\ufb00ects the cost of Algorithm 3.1, line 4. Fortunately, this step is inexpensive\nfor many regularizers of interest. Since the matrix\u2013vector product Axk in line 5 dominates the other\nvector arithmetic, each iteration requires O(md) arithmetic operations.\n3.3 The time\u2013data tradeo\ufb00\n\nProposition 3.1 suggests that increasing the smoothing parameter \u00b5 leads to faster convergence of\nthe primal iterates of the Auslender\u2013Teboulle algorithm. The discussion in Section 2.2 suggests that,\nwhen we have excess samples, we can increase the smoothing parameter while maintaining exact\nrecovery. Our main technical proposal combines these two observations:\n\n4\n\n\f(a)\n\n(b)\n\nFigure 2: Statistical dimension and maximal smoothing for the dual-smoothed (cid:96)1 norm.\nPanel (a) shows upper bounds for the normalized statistical dimension d\u22121D( f \u00b5; x(cid:92)) of the dual-\nsmoothed sparse vector recovery problem for several choices of \u00b5. Panel (b) shows lower bounds for\nthe maximal smoothing parameter \u00b5(m) for several choices of the normalized sparsity \u03c1 := s/d.\n\nAs the number m of measurements in the RLIP (1) increases, we smooth the dual\nproblem (4) more and more aggressively while maintaining exact recovery. The\nAuslender\u2013Teboulle algorithm can solve these increasingly smoothed problems\nfaster.\n\nIn order to balance the inherent tradeo\ufb00 between smoothing and accuracy, we introduce the maximal\nsmoothing parameter \u00b5(m). For a sample size m, \u00b5(m) is the largest number satisfying\n\n(cid:16)\n\n\u03b4\n\nD( f \u00b5(m); x(cid:92))\n\n\u2264 m.\n\n(6)\n\nChoosing a smoothing parameter \u00b5 \u2264 \u00b5(m) ensures that we do not cross the phase transition of\nour RLIP. In practice, we need to be less aggressive in order to avoid the \u201ctransition region\u201d. The\nfollowing two sections provide examples that use our proposal to achieve a clear time\u2013data tradeo\ufb00.\n\n4 Example: Sparse vector recovery\n\nIn this section, we apply the method outlined in Section 3 to the sparse vector recovery problem.\n\n4.1 The optimization problem\n\nAssume that x(cid:92) is a sparse vector. The (cid:96)1 norm serves as a convex proxy for sparsity, so we choose it\nas the regularizer in the RLIP (1). This problem is known as basis pursuit, and it was proposed by\nChen et al. [20]. It has roots in geophysics [21, 22].\nWe apply the dual-smoothing procedure from Section 3 to obtain the relaxed primal problem, which\nis equivalent to the elastic net of Zou and Hastie [23]. The smoothed dual is given by (4).\nTo determine the exact recovery condition, Fact 2.3, for the dual-smoothed RLIP (3), we must\ncompute the statistical dimension of the descent cones of f \u00b5. We provide an accurate upper bound.\nProposition 4.1 (Statistical dimension bound for the dual-smoothed (cid:96)1 norm). Let x \u2208 Rd with s\nnonzero entries, and de\ufb01ne the normalized sparsity \u03c1 := s/d. Then\n\n(cid:17)\n\n(cid:103)\n\n(cid:16)\n\n1\nd\n\n\u03b4\n\n(cid:17)\n\n(cid:102)\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u03c1\n\nD( f \u00b5; x)\n\n\u2264 inf\n\u03c4\u22650\n\n1 + \u03c42(1 + \u00b5 (cid:107)x(cid:107)(cid:96)\u221e\n\n)2\n\n5\n\n(cid:114)\n\n(cid:90) \u221e\n\n\u03c4\n\n2\n\u03c0\n\n+ (1 \u2212 \u03c1)\n\n(u \u2212 \u03c4)2e\u2212u2/2 du\n\n\uf8fc\uf8f4\uf8fd\uf8f4\uf8fe .\n\n00.20.40.60.8100.20.40.60.81Normalizedsparsity(\u03c1)Normalizedstatisticaldimension(\u03b4/d)Stat.dim.ofthedual-smoothed\u20181descentcones\u00b5=0\u00b5=0.1\u00b5=1\u00b5=1000.20.40.60.8110\u2212210\u22121100101102Normalizedsamplesize(m/d)Maximalsmoothingparameter(\u00b5(m))Maximaldual-smoothingofthe\u20181norm\u03c1=0.01\u03c1=0.05\u03c1=0.1\u03c1=0.2\f(a)\n\n(b)\n\nFigure 3: Sparse vector recovery experiment. The average number of iterations (a) and the average\ncomputational cost (b) of 10 random trials of the dual-smoothed sparse vector recovery problem with\nambient dimension d = 40 000 and normalized sparsity \u03c1 = 5% for various sample sizes m. The red\ncurve represents a \ufb01xed smoothing parameter \u00b5 = 0.1, while the blue curve uses \u00b5 = \u00b5(m)/4. The\nerror bars indicate the minimum and maximum observed values.\n\nThe proof is provided in the supplemental material. Figure 2 shows the statistical dimension and\nmaximal smoothing curves for sparse vectors with \u00b11 entries. In order to apply this result we only\nneed estimates of the magnitude and sparsity of the signal.\nTo apply Algorithm 3.1 to this problem, we must calculate an approximate primal solution xz from a\ndual point z (Algorithm 3.1, line 4). This step can be written as\n\nxz \u2190 \u00b5(m)\u22121 \u00b7 SoftThreshold(AT z, 1),\n\nwhere [SoftThreshold(x, t)]i = sgn (xi) \u00b7 max {|xi| \u2212 t, 0}. Algorithm 3.1, line 5 dominates the total\ncost of each iteration.\n\n4.2 The time\u2013data tradeo\ufb00\n\nWe can obtain theoretical support for the existence of a time\u2013data tradeo\ufb00 in the sparse recovery\nproblem by adapting Proposition 3.1. See the supplemental material for the proof.\nProposition 4.2 (Error bound for dual-smoothed sparse vector recovery). Let x(cid:92) \u2208 Rd with s\nnonzero entries, m be the sample size, and \u00b5(m) be the maximal smoothing parameter (6). Given a\nmeasurement matrix A \u2208 Rm\u00d7d, assume the exact recovery condition (2) holds for the dual-smoothed\nsparse vector recovery problem. Then the sequence of primal iterates from Algorithm 3.1 satis\ufb01es\n\n(cid:102)\n\n(cid:103) 1\n\n2\n\n,\n\n2d 1\n\n2 \u03ba(A)\n\n(cid:107)x(cid:92) \u2212 xk(cid:107) \u2264\n\n\u03c1 \u00b7 (1 + \u00b5(m) (cid:107)x(cid:92)(cid:107)(cid:96)\u221e\n\n)2 + (1 \u2212 \u03c1)\n\n\u00b5(m) \u00b7 k\n\nwhere \u03c1 := s/d is the normalized sparsity of x(cid:92), and \u03ba(A) is the condition number of the matrix A.\n\nFor a \ufb01xed number k of iterations, as the number m of samples increases, Proposition 4.2 suggests\nthat the error decreases like 1/\u00b5(m). This observation suggests that we can achieve a time\u2013data\ntradeo\ufb00 by smoothing.\n\n4.3 Numerical experiment\n\nFigure 3 shows the results of a numerical experiment that compares the performance di\ufb00erence\nbetween current numerical practice and our aggressive smoothing approach.\nMost practitioners use a \ufb01xed smoothing parameter \u00b5 that depends on the ambient dimension or\nsparsity but not on the sample size. For the constant smoothing case, we choose \u00b5 = 0.1 based on the\nrecommendation in [15]. It is common, however, to see much smaller choices of \u00b5 [24, 25].\n\n6\n\n11.522.533.54\u00b710450100150Samplesize(m)AveragenumberofiterationsIterationsvs.samplesize(\u20181norm)\u00b5=0.1\u00b5=\u00b5(m)/411.522.533.54\u00b71040.40.60.81\u00b71011Samplesize(m)AveragecostCostvs.samplesize(\u20181norm)\u00b5=0.1\u00b5=\u00b5(m)/4\fIn contrast, our method exploits excess samples by smoothing the dual problem more aggressively.\nWe set the smoothing parameter \u00b5 = \u00b5(m)/4. This heuristic choice is small enough to avoid the phase\ntransition of the RLIP while large enough to reap performance bene\ufb01ts. Our forthcoming work [13]\naddressing the case of noisy samples provides a more principled way to select this parameter.\nIn the experiment, we \ufb01x both the ambient dimension d = 40 000 and the normalized sparsity \u03c1 = 5%.\nTo test each smoothing approach, we generate and solve 10 random sparse vector recovery models for\neach value of the sample size m = 12 000, 14 000, 16 000, . . . , 38 000. Each random model comprises\na Gaussian measurement matrix A and a random sparse vector x(cid:92) whose nonzero entires are \u00b11 with\nequal probability. We stop Algorithm 3.1 when the relative error (cid:107)x(cid:92) \u2212 xk(cid:107) / (cid:107)x(cid:92)(cid:107) is less than 10\u22123.\nThis condition guarantees that both methods maintain the same level of accuracy.\nIn Figure 3(a), we see that for both choices of \u00b5, the average number of iterations decreases as sample\nsize increases. When we plot the total computational cost(cid:49) in Figure 3(b), we see that the constant\nsmoothing method cannot overcome the increase in cost per iteration. In fact, in this example, it would\nbe better to throw away excess data when using constant smoothing. Meanwhile, our aggressive\nsmoothing method manages to decrease total cost as sample size increases. The maximal speedup\nachieved is roughly 2.5\u00d7.\nWe note that if the matrix A were orthonormal, the cost of both smoothing methods would decrease\nas sample sizes increase. In particular, the uptick seen at m = 38 000 in Figure 3 would disappear\n(but our method would maintain roughly the same relative advantage over constant smoothing).\nThis suggests that the condition number \u03ba(A) indeed plays an important role in determining the\ncomputational cost. We believe that using a Gaussian matrix A is warranted here as statistical models\noften use independent subjects.\nLet us emphasize that we use the same algorithm to test both smoothing approaches, so the relative\ncomparison between them is meaningful. The observed improvement shows that we have indeed\nachieved a time\u2013data tradeo\ufb00 by aggressive smoothing.\n\n5 Example: Low-rank matrix recovery\n\nIn this section, we apply the method outlined in Section 3 to the low-rank matrix recovery problem.\n\n5.1 The optimization problem\nAssume that X (cid:92) \u2208 Rd1\u00d7d2 is low-rank. Consider a known measurement matrix A \u2208 Rm\u00d7d, where\nd := d1d2. We are given linear measurements of the form b = A \u00b7 vec(X (cid:92)), where vec returns the\n(column) vector obtained by stacking the columns of the input matrix. Fazel [26] proposed using the\nSchatten 1-norm (cid:107)\u00b7(cid:107) S1, the sum of the matrix\u2019s singular values, as a convex proxy for rank. Therefore,\nwe follow Recht et al. [27] and select f = (cid:107)\u00b7(cid:107) S1 as the regularizer in the RLIP (1). The low-rank\nmatrix recovery problem has roots in control theory [28].\nWe apply the dual-smoothing procedure to obtain the approximate primal problem and the smoothed\ndual problem, replacing the squared Euclidean norm in (3) with the squared Frobenius norm.\nAs in the sparse vector case, we must compute the statistical dimension of the descent cones of the\nstrongly convex regularizer f \u00b5. In the case where the matrix X is square, the following is an accurate\nupper bound for this quantity. (The non-square case is addressed in the supplemental material.)\nProposition 5.1 (Statistical dimension bound for the dual-smoothed Schatten 1-norm). Let X \u2208\nRd1\u00d7d1 have rank r, and de\ufb01ne the normalized rank \u03c1 := r/d1. Then\n\n(cid:16)\n\n1\nd2\n1\n\n\u03b4\n\n(cid:17)\n\nD( f \u00b5; X)\n\n(cid:34)\n\n(cid:16)\n\n1 + \u03c42(1 + \u00b5 (cid:107)X(cid:107))2(cid:17)\n\n\u03c1\n\n(cid:40)\n\n\u2264 inf\n0\u2264\u03c4\u22642\n\n+\n\n(cid:20)\n\n\u03c1 + (1 \u2212 \u03c1)\n(1 \u2212 \u03c1)\n12\u03c0\n\n24(1 + \u03c42) cos\u22121(\u03c4/2) \u2212 \u03c4(26 + \u03c42)\n\n(cid:112)\n\n(cid:21)(cid:35)(cid:41)\n\n4 \u2212 \u03c42\n\n+ o (1) ,\n\nas d1 \u2192 \u221e while keeping the normalized rank \u03c1 constant.\neach iteration.\n\n(cid:49)We compute total cost as k \u00b7 md, where k is the number of iterations taken, and md is the dominant cost of\n\n7\n\n\f(a)\n\n(b)\n\nFigure 4: Low-rank matrix recovery experiment. The average number of iterations (a) and the\naverage cost (b) of 10 random trials of the dual-smoothed low-rank matrix recovery problem with\nambient dimension d = 200 \u00d7 200 and normalized rank \u03c1 = 5% for various sample sizes m. The red\ncurve represents a \ufb01xed smoothing parameter \u00b5 = 0.1, while the blue curve uses \u00b5 = \u00b5(m)/4. The\nerror bars indicate the minimum and maximum observed values.\n\nThe proof is provided in the supplemental material. The plots of the statistical dimension and maximal\nsmoothing curves closely resemble those of the (cid:96)1 norm and are in the supplemental material as well.\nIn this case, Algorithm 3.1, line 4 becomes [12, Sec. 4.3]\n\nXz \u2190 \u00b5(m)\u22121 \u00b7 SoftThresholdSingVal(mat(AT z), 1),\n\nwhere mat is the inverse of the vec operator. Given a matrix X with SVD U \u00b7 diag(\u03c3) \u00b7 V T ,\n\nSoftThresholdSingVal(X, t) = U \u00b7 diag (SoftThreshold(\u03c3, t)) \u00b7 V T .\n\nAlgorithm 3.1, line 5 dominates the total cost of each iteration.\n\n5.2 The time\u2013data tradeo\ufb00\n\nWhen we adapt the error bound in Proposition 3.1 to this speci\ufb01c problem, the result is nearly same\nas in the (cid:96)1 case (Proposition 4.2). For completeness, we include the full statement of the result in\nthe supplementary material, along with its proof. Our experience with the sparse vector recovery\nproblem suggests that a tradeo\ufb00 should exist for the low-rank matrix recovery problem as well.\n\n5.3 Numerical experiment\n\nFigure 4 shows the results of a substantially similar numerical experiment to the one performed for\nsparse vectors. Again, current practice dictates using a smoothing parameter that has no dependence\non the sample size m [29]. In our tests, we choose the constant parameter \u00b5 = 0.1 recommended\nby [15]. As before, we compare this with our aggressive smoothing method that selects \u00b5 = \u00b5(m)/4.\nIn this case, we use the ambient dimension d = 200 \u00d7 200 and set the normalized rank \u03c1 = 5%. We\ntest each method with 10 random trials of the low-rank matrix recovery problem for each value of the\nsample size m = 11 250, 13 750, 16 250, . . . , 38 750. The measurement matrices are again Gaussian,\nand the nonzero singular values of the random low-rank matrices X (cid:92) are 1. We solve each problem\nwith Algorithm 3.1, stopping when the relative error in the Frobenius norm is smaller than 10\u22123.\nIn Figure 4, we see that both methods require fewer iterations for convergence as sample size increases.\nOur aggressive smoothing method additionally achieves a reduction in total computational cost, while\nthe constant method does not. The observed speedup from exploiting the additional samples is 5.4\u00d7.\nThe numerical results show that we have indeed identi\ufb01ed a time\u2013data tradeo\ufb00 via smoothing. While\nthis paper considers only the regularized linear inverse problem, our technique extends to other\nsettings. Our forthcoming work [13] addresses the case of noisy measurements, provides a connection\nto statistical learning problems, and presents additional examples.\n\n8\n\n11.522.533.54\u00b71040200400600Samplesize(m)AveragenumberofiterationsIterationsvs.samplesize(Schatten1-norm)\u00b5=0.1\u00b5=\u00b5(m)/411.522.533.54\u00b7104123\u00b71011Samplesize(m)AveragecostCostvs.samplesize(Schatten1-norm)\u00b5=0.1\u00b5=\u00b5(m)/4\fReferences\n[1] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proc.\n\n25th Annu. Int. Conf. Machine Learning (ICML 2008), pages 928\u2013935. ACM, 2008.\n\n[2] L. Bottou and O. Bousquet. The tradeo\ufb00s of large scale learning. In Advances in Neural Information\n\nProcessing Systems 20 (NIPS 2007), pages 161\u2013168, 2008.\n\n[3] A. A. Amini and M. J. Wainwright. High-dimensional analysis of semide\ufb01nite relaxations for sparse\n\nprincipal components. Ann. Statist., 37(5B):2877\u20132921, 2009.\n\n[4] A. Agarwal, P. L. Bartlett, and J. C. Duchi. Oracle inequalities for computationally adaptive model\n\nselection. arXiv, 2012, 1208.0129v1.\n\n[5] Q. Berthet and P. Rigollet. Computational Lower Bounds for Sparse PCA. arXiv, 2013, 1304.0828v2.\n[6] V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeo\ufb00s via convex relaxation. Proc.\n\nNatl. Acad. Sci. USA, 110(13):E1181\u2013E1190, 2013.\n\n[7] A. Daniely, N. Linial, and S. Shalev-Shwartz. More data speeds up training time in learning halfspaces over\nsparse vectors. In Advances in Neural Information Processing Systems 26 (NIPS 2013), pages 145\u2013153,\n2013.\n\n[8] M. I. Jordan. On statistics, computation and scalability. Bernoulli, 19(4):1378\u20131390, 2013.\n[9] D. Shender and J. La\ufb00erty. Computation-Risk Tradeo\ufb00s for Covariance-Thresholded Regression. In Proc.\n\n30th Int. Conf. Machine Learning (ICML 2013), pages 756\u2013764, 2013.\n\n[10] A. S. Nemirovsky and D. B. Yudin. Problem complexity and method e\ufb03ciency in optimization. A\n\nWiley-Interscience Publication. John Wiley & Sons Inc., New York, 1983.\n\n[11] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127\u2013152, 2005.\n[12] S. R. Becker, E. J. Cand\u00e8s, and M. C. Grant. Templates for convex cone problems with applications to\n\nsparse signal recovery. Math. Program. Comput., 3(3):165\u2013218, 2011.\n\n[13] J. J. Bruer, J. A. Tropp, V. Cevher, and S. R. Becker. Designing Statistical Estimators That Balance Sample\n\nSize, Risk, and Computational Cost. IEEE J. Sel. Topics Signal Process., to appear, 2015.\n\n[14] A. Agarwal, S. Negahban, and M. J. Wainwright. Fast Global Convergence of Gradient Methods for\n\nHigh-Dimensional Statistical Recovery. Ann. Statist., 40(5):2452\u20132482, 2012.\n\n[15] M.-J. Lai and W. Yin. Augmented l(1) and Nuclear-Norm Models with a Globally Linearly Convergent\n\nAlgorithm. SIAM J. Imaging Sci., 6(2):1059\u20131091, 2013.\n\n[16] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge: A geometric theory of phase\n\ntransitions in convex optimization. Information and Inference, to appear, 2014.\n\n[17] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The Convex Geometry of Linear Inverse\n\nProblems. Found. Comput. Math., 12(6):805\u2013849, 2012.\n\n[18] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, 2004.\n[19] A. Auslender and M. Teboulle. Interior gradient and proximal methods for convex and conic optimization.\n\nSIAM J. Optim., 16(3):697\u2013725, 2006.\n\n[20] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci.\n\nComput., 20(1):33\u201361, 1998.\n\n[21] J. F. Claerbout and F. Muir. Robust modeling with erratic data. Geophysics, 38(5):826\u2013844, 1973.\n[22] F. Santosa and W. W. Symes. Linear Inversion of Band-Limited Re\ufb02ection Seismograms. SIAM J. Sci.\n\nStat. Comput., 7(4):1307\u20131330, 1986.\n\n[23] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat.\n\nMethodol., 67:301\u2013320, 2005.\n\n[24] J.-F. Cai, S. Osher, and Z. Shen. Linearized Bregman Iterations for Compressed Sensing. Math. Comp.,\n\n78(267):1515\u20131536, 2009.\n\n[25] S. Osher, Y. Mao, B. Dong, and W. Yin. Fast linearized Bregman iteration for compressive sensing and\n\nsparse denoising. Commun. Math. Sci., 8(1):93\u2013111, 2010.\n\n[26] M. Fazel. Matrix rank minimization with applications. PhD thesis, Stanford University, 2002.\n[27] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed Minimum-Rank Solutions of Linear Matrix Equations\n\nvia Nuclear Norm Minimization. SIAM Rev., 52(3):471\u2013501, 2010.\n\n[28] M. Mesbahi and G. P. Papavassilopoulos. On the rank minimization problem over a positive semide\ufb01nite\n\nlinear matrix inequality. IEEE Trans. Automat. Control, 42(2):239\u2013243, 1997.\n\n[29] J.-F. Cai, E. J. Cand\u00e8s, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nJ. Optim., 20(4):1956\u20131982, 2010.\n\n9\n\n\f", "award": [], "sourceid": 873, "authors": [{"given_name": "John", "family_name": "Bruer", "institution": "Caltech"}, {"given_name": "Joel", "family_name": "Tropp", "institution": "Caltech"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}, {"given_name": "Stephen", "family_name": "Becker", "institution": "University of Colorado"}]}