{"title": "Accelerated Training for Matrix-norm Regularization: A Boosting Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 2906, "page_last": 2914, "abstract": "Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees $\\epsilon$ accuracy within $O(1/\\epsilon)$ iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization---exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle.", "full_text": "Accelerated Training for Matrix-norm\nRegularization: A Boosting Approach\n\nDepartment of Computing Science, University of Alberta, Edmonton AB T6G 2E8, Canada\n\nXinhua Zhang\u2217, Yaoliang Yu and Dale Schuurmans\n{xinhua2,yaoliang,dale}@cs.ualberta.ca\n\nAbstract\n\nSparse learning models typically combine a smooth loss with a nonsmooth\npenalty, such as trace norm. Although recent developments in sparse approxi-\nmation have offered promising solution methods, current approaches either apply\nonly to matrix-norm constrained problems or provide suboptimal convergence\nrates. In this paper, we propose a boosting method for regularized learning that\nguarantees \u0001 accuracy within O(1/\u0001) iterations. Performance is further accelerated\nby interlacing boosting with \ufb01xed-rank local optimization\u2014exploiting a simpler\nlocal objective than previous work. The proposed method yields state-of-the-art\nperformance on large-scale problems. We also demonstrate an application to la-\ntent multiview learning for which we provide the \ufb01rst ef\ufb01cient weak-oracle.\n\nIntroduction\n\n1\nOur focus in this paper is on unsupervised learning problems such as matrix factorization or latent\nsubspace identi\ufb01cation. Automatically uncovering latent factors that reveal important structure in\ndata is a longstanding goal of machine learning research. Such an analysis not only provides un-\nderstanding, it can also facilitate subsequent data storage, retrieval and processing. We focus in\nparticular on coding or dictionary learning problems, where one seeks to decompose a data matrix\nX into an approximate factorization \u02c6X = U V that minimizes reconstruction error while satisfying\nother properties like low rank or sparsity in the factors. Since imposing a bound on the rank or\nnumber of non-zero elements generally makes the problem intractable, such constraints are usually\nreplaced by carefully designed regularizers that promote low rank or sparse solutions [1\u20133].\nInterestingly, for a variety of dictionary constraints and regularizers, the problem is equivalent to\na matrix-norm regularized problem on the reconstruction matrix \u02c6X [1, 4]. One intensively studied\nexample is the trace norm, which corresponds to bounding the Euclidean norm of the code vectors in\nU while penalizing V via its (cid:96)21 norm. To solve trace norm regularized problems, variational meth-\nods that optimize over U and V only guarantee local optimality, while proximal gradient algorithms\nthat operate on \u02c6X [5, 6] can achieve an \u0001 accurate (global) solutions in O(1/\n\u0001) iterations, but these\nrequire singular value thresholding [7] at each iteration, preventing application to large problems.\nRecently, remarkable promise has been demonstrated for sparse approximation methods. [8] con-\nverts the trace norm problem into an optimization over positive semide\ufb01nite (PSD) matrices, then\nsolves the problem via greedy sparse approximation [9, 10]. [11] further generalizes the algorithm\nfrom trace norm to gauge functions [12], dispensing with the PSD conversion. However, these\nschemes turn the regularization into a constraint. Despite their theoretical equivalence, many practi-\ncal applications require the solution to the regularized problem, e.g. when nested in another problem.\nIn this paper, we optimize the regularized objective directly by reformulating the problem in the\nframework of (cid:96)1 penalized boosting [13, 14], allowing it to be solved with a general procedure de-\nveloped in Section 2. Each iteration of this procedure calls an oracle to \ufb01nd a weak hypothesis\n\n\u221a\n\n\u2217Xinhua Zhang is now at the National ICT Australia (NICTA), Machine Learning Group.\n\n1\n\n\f(typically a rank-one matrix) yielding the steepest local reduction of the (unregularized) loss. The\nassociated weight is then determined by accounting for the (cid:96)1 regularization. Our \ufb01rst key contri-\nbution is to establish that, when the loss is convex and smooth, the procedure \ufb01nds an \u0001 accurate\nsolution within O(1/\u0001) iterations. To the best of our knowledge, this is the \ufb01rst O(1/\u0001) objective\nvalue rate that has been rigorously established for (cid:96)1 regularized boosting. [15] considered a similar\nboosting approach, but required totally corrective updates. In addition, their rate characterizes the\ndiminishment of the gradient, and is O(1/\u00012) as opposed to O(1/\u0001) established here. [9\u201311, 16\u201318]\nestablish similar rates, but only for the constrained version of the problem.\nWe also show in Section 3 how the empirical performance of (cid:96)1 penalized boosting can be greatly im-\nproved by introducing an auxiliary rank-constrained local-optimization within each iteration. Inter-\nlacing rank constrained optimization with sparse updates has been shown effective in semi-de\ufb01nite\nprogramming [19\u201321]. [22] applied the idea to trace norm optimization by factoring the reconstruc-\ntion matrix into two orthonormal matrices and a positive semi-de\ufb01nite matrix. Unfortunately, this\nstrategy creates a very dif\ufb01cult constrained optimization problem, compelling [22] to resort to man-\nifold techniques. Instead, we use a simpler variational representation of matrix norms that leads to\na new local objective that is both unconstrained and smooth. This allows the application of much\nsimpler and much more ef\ufb01cient solvers to greatly accelerate the overall optimization.\nUnderlying standard sparse approximation methods is an oracle that ef\ufb01ciently selects a weak hy-\npothesis (using boosting terminology). Unfortunately these oracle problems are extremely challeng-\ning except in limited cases [3, 11]. Our next major contribution, in Section 4, is to formulate an\nef\ufb01cient oracle for latent multiview factorization models [2, 4], based on a positive semi-de\ufb01nite\nrelaxation that we prove incurs no gap.\nFinally, we point out that our focus in this paper is on the optimization of convex problems that relax\nthe \u201chard\u201d rank constraint. We do not explicitly minimize the rank, which is different from [23].\nNotation We use \u03b3K to denote the gauge induced by set K; (cid:107)\u00b7(cid:107)\u2217 to denote the dual norm of (cid:107)\u00b7(cid:107);\nand (cid:107)\u00b7(cid:107)F , (cid:107)\u00b7(cid:107)tr and (cid:107)\u00b7(cid:107)sp to denote the Frobenius norm, trace norm and spectral norm respectively.\ni (cid:107)Xi:(cid:107)R, while (cid:104)X, Y (cid:105) := tr(X(cid:48)Y ) denotes the inner prod-\nuct. The notation X (cid:60) 0 will denote positive semi-de\ufb01nite; X:i and Xi: stands for the i-th column\nand i-th row of matrix X; and diag {ci} denotes a diagonal matrix with the (i, i)-th entry ci.\n2 The Boosting Framework with (cid:96)1 Regularization\nConsider a coding problem where one is presented an n\u00d7m matrix Z, whose columns correspond to\nm training examples. Our goal is to learn an n\u00d7k dictionary matrix U, consisting of k basis vectors,\nand a k \u00d7 m coef\ufb01cient matrix V , such that U V approximates Z under some loss L(U V ). We sup-\npress the dependence on the data Z throughout the paper. To remove the scaling invariance between\nU and V , it is customary to restrict the bases, i.e. columns of U, to the unit ball of some norm (cid:107)\u00b7(cid:107)C.\nUnfortunately, for a \ufb01xed k, this coding problem is known to be computationally tractable only for\nthe squared loss. To retain tractability for a variety of convex losses, a popular and successful recent\napproach has been to avoid any \u201chard\u201d constraint on the number of bases, i.e. k, and instead impose\nregularizers on the matrix V that encourage a low rank or sparse solution.\nTo be more speci\ufb01c, the following optimization problem lies at the heart of many sparse learning\nmodels [e.g. 1, 3, 4, 24, 25]:\n\n(cid:107)X(cid:107)R,1 denotes the row-wise norm(cid:80)\n\n(1)\nwhere \u03bb \u2265 0 speci\ufb01es the tradeoff between loss and regularization. The (cid:107)\u00b7(cid:107)R norm in the block R-1\nnorm provides the \ufb02exibility of promoting useful structures in the solution, e.g. (cid:96)1 norm for sparse\nsolutions, (cid:96)2 norm for low rank solutions, and block structured norms for group sparsity. To solve\n(1), we \ufb01rst reparameterize the rows of \u02dcV by \u02dcVi: = \u03c3iVi:, where \u03c3i \u2265 0 and (cid:107)Vi:(cid:107)R \u2264 1. Now (1)\ncan be reformulated by introducing the reconstruction matrix X := U \u02dcV :\n(1) = min\n(2)\nX\nwhere \u03a3 = diag{\u03c3i}, and U and V in the last minimization also carry norm constraints. (2) is\nilluminating in two respects. First it reveals that the regularizer essentially seeks a rank-one decom-\nposition of the reconstruction matrix X, and penalizes the (cid:96)1 norm of the combination coef\ufb01cients\nas a proxy of the \u201crank\u201d. Second, the regularizer in (2) is now expressed precisely in the form of the\n\n(cid:107) \u02dcV (cid:107)R,1 = min\n\nL(U \u02dcV ) + \u03bb(cid:107) \u02dcV (cid:107)R,1,\n\nU, \u02dcV :(cid:107)U:i(cid:107)C\u22641,U \u02dcV =X\n\nmin\n\nU :(cid:107)U:i(cid:107)C\u22641\n\nmin\n\n\u02dcV\n\nmin\n\n\u03c3,U,V :\u03c3\u22650,U \u03a3V =X\n\n(cid:88)\n\ni\n\nL(X) + \u03bb\n\nmin\n\nL(X) + \u03bb\n\nX\n\n\u03c3i,\n\n2\n\n\f(cid:104)\u2207L(Xk\u22121), H(cid:105).\n\nAlgorithm 1: The vanilla boosting algorithm.\nRequire: The weak hypothesis set A in (3).\n1: Set X0 = 0, s0 = 0.\n2: for k = 1, 2, . . . do\n3: Hk \u2190 argmin\nH\u2208A\n(ak, bk) \u2190\n4:\nargmin\na\u22650,b\u22650\ni \u2190 ak\u03c3(k\u22121)\n\u03c3(k)\nk \u2190 bk, A(k)\n\u03c3(k)\ni=1 \u03c3(k)\ni=1 \u03c3(k)\n\n, A(k)\nk \u2190 Hk.\ni A(k)\ni = aksk\u22121 + bk.\n\n6: Xk \u2190(cid:80)k\nsk \u2190(cid:80)k\n\nL(aXk\u22121+bHk) + \u03bb(ask +b).\n, \u2200 i < k\n\ni = akXk\u22121+bkHk,\n\ni \u2190 A(k\u22121)\n\n5:\n\ni\n\ni\n\n7: end for\n\n4:\n\nAlgorithm 2: Boosting with local search.\nRequire: A set of weak hypotheses A.\n1: Set X0 = 0, U0 = V0 = \u039b0 = [ ], s0 = 0.\n2: for k = 1, 2, . . . do\n(cid:104)\u2207L(Xk\u22121), uv(cid:48)(cid:105).\n3:\n\n(uk, vk) \u2190 argmin\nuv(cid:48)\u2208A\n(ak, bk) \u2190\nargmin\na\u22650,b\u22650\nUinit \u2190 ( \u02c6Uk\u22121\n\n(cid:112)ak\u039bk\u22121,\nVinit \u2190 ((cid:112)ak\u039bk\u22121 \u02c6Vk\u22121,\n\nL(aXk\u22121+b ukv(cid:48)\nk)+\u03bb(ask+b).\n\u221a\nbkuk),\n\u221a\nbkvk)(cid:48).\nLocally optimize g(U, V ) with initial\n6:\nvalue (Uinit, Vinit). Get a solution (Uk,Vk).\n7: Xk \u2190 UkVk, \u039bk \u2190 diag{(cid:107)U:i(cid:107)C(cid:107)Vi:(cid:107)R},\nsk \u2190 1\n8: end for\n\n(cid:80)k\ni=1((cid:107)U:i(cid:107)2\n\nC + (cid:107)Vi:(cid:107)2\nR).\n\n5:\n\n2\n\ngauge function \u03b3K induced by the convex hull K of the set1\n\n(3)\nSince K is convex and symmetric (\u2212K = K), the gauge function \u03b3K is in fact a norm, hence the\nsupport function of A de\ufb01nes the dual norm ||| \u00b7 ||| (see e.g. [26, Proposition V.3.2.1]):\n\nA = {uv(cid:48) : (cid:107)u(cid:107)C \u2264 1,(cid:107)v(cid:107)R \u2264 1}.\n\n(cid:107)\u039b(cid:48)u(cid:107)\u2217\n\n(cid:107)\u039bv(cid:107)\u2217\nC ,\n\nu(cid:48)\u039bv = max\n\n|||\u039b||| := max\n\nX\u2208A tr(X(cid:48)\u039b) =\n\nmax\n\nv:(cid:107)v(cid:107)R\u22641\n\nu:(cid:107)u(cid:107)C\u22641\n\nR = max\n\nu,v:(cid:107)u(cid:107)C\u22641,(cid:107)v(cid:107)R\u22641\n\n(4)\nand the gauge function \u03b3K is simply its dual norm |||\u00b7|||\u2217. For example, when (cid:107)\u00b7(cid:107)R = (cid:107)\u00b7(cid:107)C = (cid:107)\u00b7(cid:107)2,\nwe have ||| \u00b7 ||| = (cid:107) \u00b7 (cid:107)sp, so the regularizer (as the dual norm) becomes (cid:107) \u00b7 (cid:107)tr. Another special\ncase of this result was found in [4, Theorem 1], where again (cid:107) \u00b7 (cid:107)R = (cid:107) \u00b7 (cid:107)2 but (cid:107) \u00b7 (cid:107)C is more\ncomplicated than (cid:107) \u00b7 (cid:107)2. Note that the original proofs in [1, 4] are somewhat involved. Moreover,\nthis gauge function framework is \ufb02exible enough to subsume a number of structurally regularized\nproblems [11, 12], and it is certainly possible to devise other (cid:107) \u00b7 (cid:107)R and (cid:107) \u00b7 (cid:107)C norms that would\ninduce interesting matrix norms.\nThe gauge function framework also allows us to develop an ef\ufb01cient boosting algorithm for (2), by\nresorting to the following equivalent problem:\n\n\u03c3iAi\n\n+ \u03bb\n\n\u03c3i.\n\n(5)\n\ni\n\ni\n\n{\u03c3\u2217\n\ni , A\u2217\n\ni } := argmin\n\u03c3i\u22650,Ai\u2208A\n\nThe optimal solution X\u2217 of (2) can be easily recovered as (cid:80)\n\nf ({\u03c3i, Ai}) := L\ni\u03c3\u2217\ni A\u2217\n\nf ({\u03c3i, Ai}), where\n\ni . Note that in the boosting\n\nterminology, A corresponds to the set of weak hypotheses.\n2.1 The boosting algorithm\nTo solve (5) we propose the boosting strategy presented in Algorithm 1. At each iteration, a weak\nhypothesis Hk that yields the most rapid local decrease of the loss L is selected. Then Hk is com-\nbined with the previous ensemble by tuning its weights to optimize the regularized objective. Note\nthat in Step 5 all the weak hypotheses selected in the previous steps are scaled by the same value.\nAs the (cid:96)1 regularizer requires the sum of all the weights, we introduce a variable sk that recursively\nupdates this sum in Step 6. In addition, Xk is used only in Step 3 and 4, which do not require its\nexplicit expansion in terms of the elements of A. Therefore this expansion of Xk does not need to\nbe explicitly maintained and Step 5 is included only for conceptual clarity.\n\n(cid:16)(cid:88)\n\n(cid:17)\n\n(cid:88)\n\n2.2 Rate of convergence\nWe prove the convergence rate of Algorithm 1, under the standard assumption:\nAssumption 1 L is bounded from below and has bounded sub-level sets. The problem (5) admits\nat least one minimizer X\u2217. L is differentiable and satis\ufb01es the following inequality for all \u03b7 \u2208\ni \u03c3iAi = X, Ai\u2208K, \u03c3i \u2265 0}.\n\n1Recall that the gauge function \u03b3K is de\ufb01ned as \u03b3K(X) := inf{(cid:80)\n\ni \u03c3i :(cid:80)\n\n3\n\n\f[0, 1] and A, B in the (smallest) convex set that contains both X\u2217 and the sub-level set of f (0):\nL((1 \u2212 \u03b7)A + \u03b7B) \u2264 L(A) + \u03b7 (cid:104)B \u2212 A,\u2207L(A)(cid:105) + CL\u03b72\n. Here CL > 0 is a \ufb01nite constant that\ndepends only on L and X\u2217.\n\n2\n\nTheorem 1 (Rate of convergence) Under Assumption 1, Algorithm 1 \ufb01nds an \u0001 accurate solution\nto (5) in O(1/\u0001) steps. More precisely, denoting f\u2217 as the minimum of (5), then\n\nf ({\u03c3(k)\n\ni\n\n, A(k)\n\ni }) \u2212 f\u2217 \u2264 4CL\nk + 2\n\n.\n\n(6)\n\nThe proof is given in Appendix A. Note that the rate is independent of the regularization constant \u03bb.\nk+2; it should be clear that\nIn the proof we \ufb01x the variable a in Step 4 of Algorithm 1 to be simply 2\nsetting a by line search will only accelerate the convergence. An even more aggressive scheme is\nthe totally corrective update [15], which in Step 4 \ufb01nds the weights for all A(k)\n\n\u2019s selected so far:\n\ni\n\n(cid:32) k(cid:88)\n\nmin\n\u03c3i\u22650\n\nL\n\n(cid:33)\n\nk(cid:88)\n\n\u03c3iA(k)\n\ni\n\n+ \u03bb\n\n\u03c3i.\n\n(7)\n\ni=1\n\ni=1\n\ni\n\ni \u03c3i with h((cid:80)\n\nwhere h is a convex non-decreasing function over [0,\u221e). In (5), this replaces(cid:80)\n\nBut in this case we will have to explicitly maintain the expansion of Xt in terms of the A(k)\n\u2019s.\nFor boosting without regularization, the 1/\u0001 rate of convergence is known to be optimal [27]. We\nconjecture that 1/\u0001 is also a lower bound for regularized boosting.\nExtensions Our proof technique allows the regularizer to be generalized to the form h(\u03b3K(X)),\ni \u03c3i).\nBy taking h(x) as the indicator h(x) = 0 if x \u2264 1;\u221e otherwise, our rate can be straightforwardly\ntranslated into the constrained setting.\n3 Local Optimization with Fixed Rank\nIn Algorithm 1, Xk is determined by searching in the conic hull of Xk\u22121 and Hk.2 Suppose there\nexists some auxiliary procedure that allows Xk to be further improved somehow to Yk (e.g. by local\ngreedy search), then the overall optimization can bene\ufb01t from it. The only challenge, nevertheless,\nis how to restore the \u201ccontext\u201d from Yk, especially the bases Ai and their weights \u03c3i.\nIn particular, suppose we have an auxiliary function g and the following procedure is feasible:\n1. Initialization: given an ensemble {\u03c3i, Ai}, there exists a S such that g(S) \u2264 f ({\u03c3i, Ai}).\n2. Local optimization: some (local) optimizer can \ufb01nd a T such that g(T ) \u2264 g(S).\n3. Recovery: one can recover an ensemble {\u03b2i, Bi : \u03b2i \u2265 0, Bi \u2208 A} such that f ({\u03b2i, Bi}) \u2264 g(T ).\nThen obviously the new ensemble {\u03b2i, Bi} improves upon {\u03c3i, Ai}. This local search scheme can\ni }. Perform\nbe easily embedded into Algorithm 1 as follows. After Step 5, initialize S by {\u03c3(k)\ni \u03b2i.\nThe rate of convergence will directly carry over. However, the major challenge here is the potentially\nexpensive step of recovery because little assumption or constraint is made on T .\nFortunately, a careful examination of Algorithm 1 reveals that a complete recovery of {\u03b2i, Bi} is not\nrequired. Indeed, only two \u201csuf\ufb01cient statistics\u201d are needed: Xk and sk, and therefore it suf\ufb01ces to\nrecover them only. Next we will show how this can be accomplished ef\ufb01ciently in (2) . Two simple\npropositions will play a key role. Both proofs can be found in Appendix C.\nProposition 1 For the gauge \u03b3K induced by K, the convex hull of A in (3), we have\n\nlocal optimization and recover {\u03b2i, Bi}. Then replace Step 6 by Xk =(cid:80)\n\ni \u03b2iBi and sk =(cid:80)\n\n, A(k)\n\ni\n\n\u03b3K(X) = min\n\nU,V :U V =X\n\n1\n2\n\n(cid:107)U:i(cid:107)2\n\nC + (cid:107)Vi:(cid:107)2\n\nR\n\n.\n\n(8)\n\n(cid:18)\n\n(cid:88)\n\ni\n\n(cid:19)\n\n2 This does not mean Xk is a minimizer of L(X) + \u03bb\u03b3K(X) in that cone, because the bases are not\noptimized simultaneously. Incidentally, this also shows why working with (5) turns out to be more convenient.\n\n4\n\n\f(cid:19)\n\n(cid:88)\n\n\u03bb\n2\n\n(cid:107)U:i(cid:107)2\n\n(cid:18)\nk(cid:88)\n\nC + (cid:107)Vi:(cid:107)2\n(cid:18)\nk(cid:88)\n\n1\n2\n\nIf (cid:107)\u00b7(cid:107)R = (cid:107)\u00b7(cid:107)C = (cid:107) \u00b7 (cid:107)2, then \u03b3K becomes the trace norm (as we saw before), and(cid:80)\n\nR) is simply (cid:107)U(cid:107)2\n\n(cid:107)Vi:(cid:107)2\nnorm [28]. This motivates us to choose the auxiliary function as\n\nF + (cid:107)V (cid:107)2\n\nC +\nF . Then Proposition 1 is a well-known variational form of the trace\n\ni((cid:107)U:i(cid:107)2\n\nk(cid:88)\n\ni=1\n\n(9)\nProposition 2 For any U \u2208Rm\u00d7k and V\u2208Rk\u00d7n, there exist \u03c3i\u2265 0, ui\u2208Rm, and vi\u2208Rn such that\n\ng(U, V ) = L(U V ) +\n\nR\n\n.\n\ni\n\ni=1\n\ni=1\n\n\u03c3i =\n\nU V =\n\n(cid:107)U:i(cid:107)2\n\n\u03c3iuiv(cid:48)\ni,\n\n(cid:107)vi(cid:107)R \u2264 1,\n\n(cid:107)ui(cid:107)C \u2264 1,\n\nC + (cid:107)Vi:(cid:107)2\nNow we can specify concrete details for local optimization in the context of matrix norms:\n1. Initialize: given {\u03c3i \u2265 0, uiv(cid:48)\n\u03c31u1, . . . ,\n\n\u03c3kvk)(cid:48).\n2. Locally optimize g(U, V ) with initialization (Uinit, Vinit), to obtain a solution (U\u2217, V \u2217).\n3. Recovery: use Proposition 2 to (conceptually) recover {\u03b2i, \u02c6ui, \u02c6vi} from (U\u2217, V \u2217).\nThe key advantage of this procedure is that Proposition 2 allows Xk and sk to be computed directly\nfrom (U\u2217, V \u2217), keeping the recovery completely implicit:\n\ni=1, set (Uinit, Vinit) to satisfy g(Uinit, Vinit) = f ({\u03c3i, uiv(cid:48)\n\ni \u2208 A}k\n\u221a\n\u03c3kuk),\n\n\u221a\nUinit = (\n\n\u221a\nVinit = (\n\ni}):\n(11)\n\n\u03c31v1, . . . ,\n\n.\n\n(10)\n\nand\n\n\u221a\n\nR\n\n(cid:19)\n\n(cid:19)\n\nXk =\n\n\u03b2i \u02c6ui \u02c6v(cid:48)\n\ni = U\u2217V \u2217,\n\nand sk =\n\ni=1\n\ni=1\n\n(cid:107)U\u2217\n:i(cid:107)2\n\nC + (cid:107)V \u2217\ni:(cid:107)2\n\nR\n\n.\n\n(12)\n\nIn addition, Proposition 2 ensures that locally improving the solution does not incur an increment\nin the number of weak hypotheses. Using the same trick, the (Uinit, Vinit) in (11) for the (k + 1)-th\niteration can also be formulated in terms of (U\u2217, V \u2217). Different from the local optimization for\ntrace norm in [21] which naturally works on the original objective, our scheme requires a nontrivial\n(variational) reformulation of the objective based on Propositions 1 and 2.\nThe \ufb01nal algorithm is summarized in Algorithm 2, where \u02c6U and \u02c6V in Step 5 denote the column-wise\nand row-wise normalized versions of U and V , respectively. Compared to the local optimization in\n[22], which is hampered by orthogonal and PSD constraints, our (local) objective in (9) is uncon-\nstrained and smooth for many instances of (cid:107)\u00b7(cid:107)C and (cid:107)\u00b7(cid:107)R. This is plausible because no other con-\nstraints (besides the norm constraint), such as orthogonality, are imposed on U and V in Proposition\n2. Thus the local optimization we face, albeit non-convex in general, is more amenable to ef\ufb01cient\nsolvers such as L-BFGS.\nRemark Consider if one performs totally corrective update as in (7). Then all of the coef\ufb01cients\nand weak hypotheses from (U\u2217, V \u2217) have to be considered, which can be computationally expen-\nsive. For example, in the case of trace norm, this leads to a full SVD on U\u2217V \u2217. Although U\u2217 and V \u2217\nusually have low rank, which can be exploited to ameliorate the complexity, it is clearly preferable\nto completely eliminate the recovery step, as in Algorithm 2.\n4 Latent Generative Model with Multiple Views\nUnderlying most boosting algorithms is an oracle that identi\ufb01es the steepest descent weak hypothesis\n(Step 3 of Algorithm 1). Approximate solutions often suf\ufb01ce [8, 9]. When (cid:107)\u00b7(cid:107)R and (cid:107)\u00b7(cid:107)C are both\nEuclidean norms, this oracle can be ef\ufb01ciently computed via the leading left and right singular vector\npair. However, for most other interesting cases like low rank tensors, such an oracle is intractable\n[29]. In this section we discover that for an important problem of multiview learning, the oracle can\nbe surprisingly solved in polynomial time, yielding an ef\ufb01cient computational strategy.\nMultiview learning analyzes multi-modal data, such as heterogeneous descriptions of text, image and\nvideo, by exploiting the implicit conditional independence structure. In this case, beyond a single\ndictionary U and coef\ufb01cient matrix V that model a single view Z (1), multiple dictionaries U (k) are\nneeded to reconstruct multiple views Z (k), while keeping the latent representation V shared across\nall views. Formally the problem in multiview factorization is to optimize [2, 4]:\n\nLt(U (t)V ) + \u03bb(cid:107)V (cid:107)R,1 .\n\n(13)\n\nk(cid:88)\n\nk(cid:88)\n\n\u03c3i =\n\n1\n2\n\n(cid:18)\n\nk(cid:88)\n\ni=1\n\nk(cid:88)\n\nt=1\n\nmin\nU (1):(cid:107)U (1)\n:i (cid:107)C\u22641\n\n. . .\n\nmin\nU (k):(cid:107)U (k)\n:i (cid:107)C\u22641\n\nmin\n\nV\n\n5\n\n\fWe can easily re-express the problem as an equivalent \u201csingle\u201d view formulation (1) by stacking all\n{U (t)} into the rows of a big matrix U, with a new column norm (cid:107)U:i(cid:107)C := max\n:i (cid:107)C. Then\nthe constraints on U (t) in (13) can be equivalently written as (cid:107)U:i(cid:107)C \u2264 1, and Algorithm 2 can be\ndirectly applied with two specializations. First the auxiliary function g(U, V ) in (9) becomes\n\n(cid:107)U (t)\n\nt=1...k\n\n(cid:19)\n\n\u03bb\n2\n\n(cid:18)\n(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:88)\n\ni\n\nt\n\n(cid:18)(cid:18)\n(cid:88)\n\ni\n\n(cid:19)2\n:i (cid:107)C\n\n(cid:107)U (t)\n\n(cid:19)\n\ng(U, V ) = L(U V )+\n\n\u03bb\n2\n\nmax\nt=1...k\n\n+(cid:107)Vi:(cid:107)2\n\nR\n\n= L(U V )+\n\n(cid:107)U (t)\n:i (cid:107)2\n\nC +(cid:107)Vi:(cid:107)2\n\nR\n\nmax\nt=1...k\n\nwhich can be locally optimized. The only challenge left is the oracle problem in (4), which takes the\nfollowing form when all norms are Euclidean:\nu(cid:48)\u039bv = max\n(cid:107)u(cid:107)C\u22641\n\n(cid:107)\u039b(cid:48)u(cid:107)2 = max\n\n(cid:107)u(cid:107)C\u22641,(cid:107)v(cid:107)\u22641\n\nu:\u2200t,(cid:107)ut(cid:107)\u22641\n\n\u039b(cid:48)\ntut\n\n(14)\n\nmax\n\n.\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n[4, 24] considered the case where k = 2 and showed that exact solutions to (14) can be found ef\ufb01-\nciently. But their derivation does not seem to extend to k > 2. Fortunately there is still an interesting\nand tractable scenario. Consider multilabel classi\ufb01cation with a small number of classes, and U (1)\nand U (2) are two views of features (e.g. image and text). Then each class label corresponds to a\nview and the corresponding ut is univariate. Since there must be an optimal solution on the extreme\npoints of the feasible region, we can enumerate {\u22121, 1} for ut (t \u2265 3) and for each assignment solve\na subproblem of the following form that instantiates (14) (c is a constant vector)\n\n(cid:107)\u039b(cid:48)\n\n1u1 + \u039b(cid:48)\n\n2u2 + c(cid:107)2 , s.t. (cid:107)u1(cid:107) \u2264 1, (cid:107)u2(cid:107) \u2264 1.\n\n(QP ) max\nu1,u2\n\n(cid:32) r\n\n(cid:33)\n\nz\n\n(cid:32) 0\n\nDue to inhomogeneity, the technique in [4] is not applicable. Rewrite (15) in matrix form\n(cid:104)I00, zz(cid:48)(cid:105) = 1,\n\n(cid:104)M2, zz(cid:48)(cid:105) \u2264 0\n\n(cid:104)M1, zz(cid:48)(cid:105) \u2264 0\n\n(cid:104)M0, zz(cid:48)(cid:105)\n\n(QP ) min\n\ns.t.\n\n(cid:33)\n\n(cid:32)\u22121\n\n(cid:33)\n\n(cid:32)\u22121\n\n(15)\n\n(16)\n\n(cid:33)\n\n, M0 = \u2212\n\nc(cid:48)\u039b(cid:48)\n\u039b1c \u039b1\u039b(cid:48)\n\u039b2c \u039b2\u039b(cid:48)\n\nc(cid:48)\u039b(cid:48)\n1 \u039b1\u039b(cid:48)\n1 \u039b2\u039b(cid:48)\n\n1\n\n2\n\nu1\nu2\n\nwhere z=\n,\nand I00 is a zero matrix with only the (1, 1)-th entry being 1. Let X = zz(cid:48), a semi-de\ufb01nite program-\nming relaxation for (QP ) can be obtained by dropping the rank-one constraint:\n\n, M1=\n\n, M2=\n\n0\n\n0\n\nI\n\nI\n\n2\n\n2\n\n(cid:104)M0, X(cid:105) ,\n\ns.t.\n\n(cid:104)M1, X(cid:105) \u2264 0,\n\n(cid:104)M2, X(cid:105) \u2264 0,\n\n(cid:104)I00, X(cid:105) = 1, X (cid:23) 0.\n\n(17)\n\n(SP ) min\n\nX\n\nIts dual problem, which is also the Lagrange dual of (QP ), can be written as\n\ns.t. Z := M0 \u2212 y0I00 + y1M1 + y2M2 (cid:23) 0,\n\ny1 \u2265 0,\n\ny2 \u2265 0.\n\n(18)\n\n(SD)\n\nmax\ny0,y1,y2\n\ny0,\n\n(SD) is a convex problem that can be solved ef\ufb01ciently by, e.g., cutting plane methods. (SP ) is\nalso a convex semide\ufb01nite program (SDP) amenable for standard SDP solvers. However further\nrecovering the solution to (QP ) is not straightforward, because there may be a gap between the\noptimal values of (SP ) and (QP ). The gap is zero (i.e. strong duality between (QP ) and (SD))\nonly if the rank-one constraint that (SP ) dropped from (QP ) is automatically satis\ufb01ed, i.e. if (SP )\nhas a rank-one optimal solution.\nFortunately, as one of our main results, we prove that strong duality always holds for the particular\nproblem originating from (15). Our proof utilizes some recent development in optimization [30],\nand is relegated to Appendix D.\n5 Experimental Results\nWe compared our Algorithm 2 with three state-of-the-art solvers for trace norm regularized objec-\ntives: MMBS3 [22], DHM [15], and JS [8]. JS was proposed for solving the constrained problem:\nminX L(X) s.t. (cid:107)X(cid:107)tr \u2264 \u03b6, which makes it hard to compare with solvers for the penalized prob-\nlem: minX L(X) + \u03bb(cid:107)X(cid:107)tr. As a workaround, we \ufb01rst chose a \u03bb, and found the optimal solution\nX\u2217 for the penalized problem. Then we set \u03b6 = (cid:107)X\u2217(cid:107)tr and \ufb01nally solved the constrained problem\nby JS. In this case, it is only fair to compare how fast L(X) (loss) is decreased by various solvers,\nrather than L(X) + \u03bb(cid:107)X(cid:107)\u2217 (objective). DHM is sensitive to the estimate of the Lipschitz constant\nof the gradient of L, which we manually tuned for a small value such that DHM still converges.\nSince the code for MMBS is specialized to matrix completion, it was used only in this comparison.\nTraditional solvers such as proximal methods [6] were not included because they are much slower.\n\n3 http://www.montefiore.ulg.ac.be/\u02dcmishra/softwares/traceNorm.html\n\n6\n\n\f(a) Objective & loss vs time (loglog)\n\n(a) Objective & loss vs time (loglog)\n\n(a) Objective & loss vs time (loglog)\n\n(b) Test NMAE vs time (semilogx)\n\n(b) Test NMAE vs time (semilogx)\n\n(b) Test NMAE vs time (semilogx)\n\nFigure 1: MovieLens100k.\n\nFigure 2: MovieLens1M.\n\nFigure 3: MovieLens10M.\n\nComparison 1: Matrix completion We \ufb01rst compared all methods on a matrix completion prob-\nlem, using the standard datasets MovieLens100k, MovieLens1M, and MovieLens10M [6, 8, 21],\nwhich are sized 943\u00d7 1682, 6040\u00d7 3706, and 69878\u00d7 10677 respectively (#user \u00d7 #movie). They\ncontain 105, 106 and 107 movie ratings valued from 1 to 5, and the task is to predict the rating for\na user on a movie. The training set was constructed by randomly selecting 50% ratings for each\nuser, and the prediction is made on the rest 50% ratings. In Figure 1 to 3, we show how fast various\nalgorithms drive down the training objective, training loss L (squared Euclidean distance), and the\nnormalized mean absolute error (NMAE) on the test data [see, e.g., 6, 8]. We tuned the \u03bb to optimize\nthe test NMAE.\nFrom Figure 1(a), 2(a), 3(a), it is clear that it takes much less amount of CPU time for our method to\nreduce the objective value (solid line) and the loss L (dashed line). This implies that local search and\npartially corrective updates in our method are very effective. Not surprisingly MMBS is the closest\nto ours in terms of performance because it also adopts local optimization. However it is still slower\nbecause their local search is conducted on a constrained manifold. In contrast, our local search\nobjective is entirely unconstrained and smooth, which we manage to solve ef\ufb01ciently by L-BFGS.4\nJS, though applied indirectly, is faster than DHM in reducing the loss. We observed that DHM kept\nrunning coordinate descent with a constant step size, while the totally corrective update was rarely\ntaken. We tried accelerating it by using a smaller value of the estimate of the Lipschitz constant of\nthe gradient of L, but it leads to divergence after a rapid decrease of the objective for the \ufb01rst few\niterations. A hybrid approach might be useful.\nWe also studied the evolution of the NMAE performance on the test data. For this we compared the\nmatrix reconstruction after each iteration against the ground truth. As plotted in Figure 1(b), 2(b),\n3(b), our approach achieves comparable (or better) NMAE in much less time than all other methods.\nComparison 2: multitask and multiclass learning Secondly, we tested on a multiclass classi\ufb01-\ncation problem with synthetic dataset. Following [15], we generated a dataset of D = 250 features\nand C = 100 classes. Each class c has 10 training examples and 10 test examples drawn inde-\npendently and identically from a class-speci\ufb01c multivariate Gaussian N (\u00b5c, \u03a3c). \u00b5c \u2208 R250 has\nthe last 200 coordinates being 0, and the top 50 coordinates were chosen uniformly random from\n{\u22121, 1}. The (i, j)-th element of \u03a3c is 22(0.5)|i\u2212j|. The task is to predict the class membership of\na given example. We used the logistic loss for a model matrix W \u2208 RD\u00d7C. In particular, for each\n\n4 http://www.cs.ubc.ca/\u02dcpcarbo/lbfgsb-for-matlab.html\n\n7\n\n10\u22122100102105106MovieLens\u2212100k, \u03bb = 20Running time (seconds)Objective and loss (training) Ours\u2212objOurs\u2212lossMMBS\u2212objMMBS\u2212lossDHM\u2212objDHM\u2212lossJS\u2212loss10\u221221001020.20.30.40.50.60.70.80.9MovieLens\u2212100k, \u03bb = 20Running time (seconds)Test error OursMMBSDHMJS100102104106107MovieLens\u22121m, \u03bb = 50Running time (seconds)Objective and loss (training) Ours\u2212objOurs\u2212lossMMBS\u2212objMMBS\u2212lossDHM\u2212objDHM\u2212lossJS\u2212loss1001021040.20.30.40.50.60.70.80.9MovieLens\u22121m, \u03bb = 50Running time (seconds)Test error OursMMBSDHMJS100102104106106107108109MovieLens\u221210m, \u03bb = 50Running time (seconds)Objective and loss (training) Ours\u2212objOurs\u2212lossMMBS\u2212objMMBS\u2212lossDHM\u2212objDHM\u2212lossJS\u2212loss1001021041060.10.20.30.40.50.60.70.80.9MovieLens\u221210m, \u03bb = 50Running time (seconds)Test error OursMMBSDHMJS\ftraining example xi with label\nyi \u2208 {1, .., C}, we de\ufb01ned an\nindividual loss Li(W ) as\nLi(W ) = \u2212 log p(yi|xi; W ),\nwhere for any class c,\np(c|xi;W ) = Z\u22121\n\n(cid:88)\ni exp(W (cid:48)\nexp(W (cid:48)\n\n:cxi),\n:cxi).\n\nZi =\n\nc\n\n(a) Objective & loss vs time (loglog)\n\n(a) Objective & loss vs time (loglog)\n\n(b) Test error vs time (semilogx)\nFigure 4: Multiclass classi\ufb01ca-\ntion with synthetic datset.\n\n(b) Test error vs time (semilogx)\nFigure 5: Multitask learning for\nschool dataset.\n\nThen L(W ) is de\ufb01ned as the\naverage of Li(W ) over the\nwhole training set. We found\nthat \u03bb = 0.01 yielded the\nlowest test classi\ufb01cation er-\nror; the corresponding results\nare given in Figure 4. Clearly,\nthe intermediate models out-\nput by our scheme achieve\ncomparable (or better) train-\ning objective and test error in\norders of magnitude less time\nthan those generated by DHM\nand JS.\nWe also applied the solvers to a multitask learning problem with\nthe school dataset [25]. The task is to predict the score of\n15362 students from 139 secondary schools based on a number\nof school-speci\ufb01c and student-speci\ufb01c attributes. Each school is\nconsidered as a task for which a predictor is learned. We used the\n\ufb01rst random split of training and testing data provided by [25] 5,\nand set \u03bb so as to achieve the lowest test squared error. Again,\nas shown in Figure 5 our approach is much faster than DHM and\nJS in \ufb01nding the optimal solution for training objective and test\nerror. As the problem requires a large \u03bb, the trace norm penalty\nFigure 6: Multiview training.\nis small, making the loss close to the objective.\nComparison 3: Multiview learning Finally we perform an initial test on our global optimization\ntechnique for learning latent models with multiple views. We used the Flickr dataset from NUS-\nWIDE [31]. Its \ufb01rst view is a 634 dimensional low-level feature, and the second view consists of\n1000 dimensional tags. The class labels correspond to the type of animals and we randomly chose 5\ntypes with 20 examples in each type. The task is to train the model in (13) with \u03bb = 10\u22123. We used\nsquared loss for the \ufb01rst view, and logistic loss for the other views.\nWe compared our method with a local optimization approach to solving (13). The local method \ufb01rst\n\ufb01xes all U (t) and minimizes V , which is a convex problem that can be solved by FISTA [32]. Then\nit \ufb01xes V and optimizes U (t), which is again convex. We let Alt refer to the scheme that alternates\nthese updates to convergence. From Figure 6 it is clear that Alt is trapped by a locally optimal\nsolution, which is inferior to a globally optimal solution that our method is guaranteed to \ufb01nd. Our\nmethod also reduces both the objective and the loss slightly faster than Alt.\n6 Conclusion and Outlook\nWe have proposed a new boosting algorithm for a wide range of matrix norm regularized problems.\nIt is closely related to generalized conditional gradient method [33]. We established the O(1/\u0001)\nconvergence rate, and showed its empirical advantage over state-of-the-art solvers on large scale\nproblems. We also applied the method to a novel problem, latent multiview learning, for which we\ndesigned a new ef\ufb01cient oracle. We plan to study randomized boosting with (cid:96)1 regularization [34] ,\nand to extend the framework to more general nonlinear regularization [3].\n\n5http://ttic.uchicago.edu/\u02dcargyriou/code/mtl_feat/school_splits.tar\n\n8\n\n10010210\u22121100101102Synthetic multiclass, \u03bb = 0.01Running time (seconds)Objective and loss (training) Ours\u2212objOurs\u2212lossDHM\u2212objDHM\u2212lossJS\u2212loss1001020.650.70.750.80.850.90.95Multiclass, \u03bb = 0.01Running time (seconds)Test error OursDHMJS100102101102103104School multitask, \u03bb = 0.1Running time (seconds)Objective and loss (training) Ours\u2212objOurs\u2212lossDHM\u2212objDHM\u2212lossJS\u2212loss10010202004006008001000School Multitask, \u03bb = 0.1Running time (seconds)Test regression error OursDHMJS102103100101102Multiview Flickr, \u03bb=0.001Running time (seconds)Objective and loss (training) Ours\u2212objOurs\u2212lossAlt\u2212objAlt\u2212loss\fReferences\n[1] F. Bach, J. Mairal, and J. Ponce. Convex sparse matrix factorizations. arXiv:0812.1869v1, 2008.\n[2] H. Lee, R. Raina, A. Teichman, and A. Ng. Exponential family sparse coding with application to self-\n\ntaught learning. In IJCAI, 2009.\n\n[3] D. Bradley and J. Bagnell. Convex coding. In UAI, 2009.\n[4] X. Zhang, Y-L Yu, M. White, R. Huang, and D. Schuurmans. Convex sparse coding, subspace learning,\n\nand semi-supervised extensions. In AAAI, 2011.\n\n[5] T. K. Pong, P. Tseng, S. Ji, and J. Ye. Trace norm regularization: Reformulations, algorithms, and multi-\n\ntask learning. SIAM Journal on Optimization, 20(6):3465\u20133489, 2010.\n\n[6] K-C Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least\n\nsquares problems. Paci\ufb01c Journal of Optimization, 6:615\u2013640, 2010.\n\n[7] J-F Cai, E. J. Cand\u00b4es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nJournal on Optimization, 20(4):1956\u20131982, 2010.\n\n[8] M. Jaggi and M. Sulovsky. A simple algorithm for nuclear norm regularized problems. In ICML, 2010.\n[9] E. Hazan. Sparse approximate solutions to semide\ufb01nite programs. In LATIN, 2008.\n[10] K. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. In SODA, 2008.\n[11] A. Tewari, P. Ravikumar, and I. S. Dhillon. Greedy algorithms for structurally constrained high dimen-\n\nsional problems. In NIPS, 2011.\n\n[12] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse\n\nproblems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[13] Y. Bengio, N.L. Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In NIPS, 2005.\n[14] L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Functional gradient techniques for combining hypothe-\n\nses. In Advances in Large Margin Classi\ufb01ers, pages 221\u2013246, Cambridge, MA, 2000. MIT Press.\n\n[15] M. Dudik, Z. Harchaoui, and J. Malick. Lifted coordinate descent for learning with trace-norm regular-\n\nizations. In AISTATS, 2012.\n\n[16] S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity in optimization problems with\n\nsparsity constraints. SIAM Journal on Optimization, 20:2807\u20132832, 2010.\n\n[17] X. Yuan and S. Yan. Forward basis selection for sparse approximation over dictionary. In AISTATS, 2012.\nIEEE Trans.\n[18] T. Zhang. Sequential greedy approximation for certain convex optimization problems.\n\nInformation Theory, 49(3):682\u2013691, 2003.\n\n[19] S. Burer and R. Monteiro. Local minima and convergence in low-rank semide\ufb01nite programming. Math-\n\nematical Programming, 103(3):427\u2013444, 2005.\n\n[20] M. Journee, F. Bach, P.-A. Absil, and R. Sepulchre. Low-rank optimization on the cone of positive\n\nsemide\ufb01nite matrices. SIAM Journal on Optimization, 20:2327C\u20132351, 2010.\n\n[21] S. Laue. A hybrid algorithm for convex semide\ufb01nite optimization. In ICML, 2012.\n[22] B. Mishra, G. Meyer, F. Bach, and R. Sepulchre. Low-rank optimization with trace norm penalty. Tech-\n\nnical report, 2011. http://arxiv.org/abs/1112.2318.\n\n[23] S. Shalev-Shwartz, A. Gonen, and O. Shamir. Large-scale convex minimization with a low-rank con-\n\nstraint. In ICML, 2011.\n\n[24] M. White, Y. Yu, X. Zhang, and D. Schuurmans. Convex multi-view subspace learning. In NIPS, 2012.\n[25] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):\n\n243\u2013272, 2008.\n\n[26] J-B Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms, I and II, volume\n\n305 and 306. Springer-Verlag, 1993.\n\n[27] I. Mukherjee, C. Rudin, and R. Schapire. The rate of convergence of Adaboost. In COLT, 2011.\n[28] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In NIPS, 2005.\n[29] C. Hillar and L-H Lim. Most tensor problems are NP-hard. arXiv:0911.1393v3, 2012.\n[30] W. Ai and S. Zhang. Strong duality for the CDT subproblem: A necessary and suf\ufb01cient condition. SIAM\n\nJournal on Optimization, 19:1735\u20131756, 2009.\n\n[31] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.T. Zhang. A real-world web image database from\n\nnational university of singapore. In International Conference on Image and Video Retrieval, 2009.\n\n[32] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[33] K. Bredies, D. Lorenz, and P. Maass. A generalized conditional gradient method and its connection to an\n\niterative shrinkage method. Computational Optimization and Applications, 42:173\u2013193, 2009.\n\n[34] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1319, "authors": [{"given_name": "Xinhua", "family_name": "Zhang", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": null}, {"given_name": "Yao-liang", "family_name": "Yu", "institution": null}]}