{"title": "Deep Coding Network", "book": "Advances in Neural Information Processing Systems", "page_first": 1405, "page_last": 1413, "abstract": "This paper proposes a principled extension of the traditional single-layer flat sparse coding scheme, where a two-layer coding scheme is derived based on theoretical analysis of nonlinear functional approximation that extends recent results for local coordinate coding. The two-layer approach can be easily generalized to deeper structures in a hierarchical multiple-layer manner. Empirically, it is shown that the deep coding approach yields improved performance in benchmark datasets.", "full_text": "Deep Coding Network\n\nYuanqing Lin\u2020 Tong Zhang\u2021\n\nShenghuo Zhu\u2020 Kai Yu\u2020\n\n\u2020NEC Laboratories America, Cupertino, CA 95129\n\n\u2021Rutgers University, Piscataway, NJ 08854\n\nAbstract\n\nThis paper proposes a principled extension of the traditional single-layer \ufb02at\nsparse coding scheme, where a two-layer coding scheme is derived based on the-\noretical analysis of nonlinear functional approximation that extends recent results\nfor local coordinate coding. The two-layer approach can be easily generalized\nto deeper structures in a hierarchical multiple-layer manner. Empirically, it is\nshown that the deep coding approach yields improved performance in benchmark\ndatasets.\n\n1 Introduction\n\nSparse coding has attracted signi\ufb01cant attention in recent years because it has been shown to be\neffective for some classi\ufb01cation problems [12, 10, 9, 13, 11, 14, 2, 5]. In particular, it has been em-\npirically observed that high-dimensional sparse coding plus linear classi\ufb01er is successful for image\nclassi\ufb01cation tasks such as PASCAL 2009 [7, 15].\n\nThe empirical success of sparse coding can be justi\ufb01ed by theoretical analysis [17], which showed\nthat a modi\ufb01cation of sparse coding with added locality constraint, called local coordinate cod-\ning (LCC), represents a new class of effective high dimensional non-linear function approximation\nmethods with sound theoretical guarantees. Speci\ufb01cally, LCC learns a nonlinear function in high\ndimension by forming an adaptive set of basis functions on the data manifold, and it has nonlinear\napproximation power. A recent extension of LCC with added local tangent directions [16] demon-\nstrated the possibility to achieve locally quadratic approximation power when the underlying data\nmanifold is relatively \ufb02at. This also indicates that the nonlinear function approximation view of\nsparse coding not only yields deeper theoretical understanding of its success, but also leads to im-\nproved algorithms based on re\ufb01ned analysis. This paper follows the same idea, where we propose a\nprincipled extension of single-layer sparse coding based on theoretical analysis of a two level coding\nscheme.\n\nThe algorithm derived from this approach has some advantages over the single-layer approach, and\ncan also be extended into multi-layer hierarchical systems. Such extension draws connection to\ndeep belief networks (DBN) [8], and hence we call this approach deep coding network. Hierarchi-\ncal sparse coding has two main advantages over its single-layer counter-part. First, at the intuitive\nlevel, the \ufb01rst layer (traditional single-layer basis) yields a crude description of the data at each ba-\nsis function, and multi-layer basis functions provide a natural way to zoom into each single basis\nfor \ufb01ner local details \u2014 this intuition can be re\ufb02ected more rigorously in our nonlinear function\napproximation result. Due to the more localized zoom-in effect, it also alleviates the problem of\nover\ufb01tting when many basis functions are needed. Second, it is computationally more ef\ufb01cient than\n\ufb02at coding because we only need to look at locations in the second (or higher) layer corresponding\nto basis functions with nonzero coef\ufb01cients in the \ufb01rst (or previous) layer. Since sparse coding pro-\nduces many zero coef\ufb01cients, the hierarchical structure signi\ufb01cantly eliminates many of the coding\ncomputation. Moreover, instead of \ufb01tting a single model with many variables as in a \ufb02at single layer\napproach, our proposal of multi-layer coding requires \ufb01tting many small models separately, each\n\n1\n\n\fwith a small number of parameters. In particular, \ufb01tting the small models can be done in parallel,\ne.g. using Hadoop, so that learning a fairly big number of codebooks can still be fast.\n\n2 Sparse Coding and Nonlinear Function Approximation\n\nThis section reviews the nonlinear function approximation results of single-layer coding scheme in\n[17], and then presents our multi-layer extension. Since the result of [17] requires a modi\ufb01cation of\nthe traditional sparse coding scheme called local coordinate coding (LCC), our analysis will rely on\na similar modi\ufb01cation.\n\nConsider the problem of learning a nonlinear function f (x) in high dimension: x \u2208 Rd with large\nd. While there are many algorithms in traditional statistics that can learn such a function in low\ndimension, when the dimensionality d is large compared to n, the traditional statistical methods will\nsuffer the so called \u201ccurse of dimensionality\u201d. The recently popularized coding approach addresses\nthis issue. Speci\ufb01cally, it was theoretically shown in [17] that a speci\ufb01c coding scheme called Local\nCoordinate Coding can take advantage of the underlying data manifold geometric structure in order\nto learn a nonlinear function in high dimension and alleviate the curse of dimensionality problem.\n\nThe main idea of LCC, described in [17], is to locally embed points on the underlying data manifold\ninto a lower dimensional space, expressed as coordinates with respect to a set of anchor points. The\nmain theoretical observation was relatively simple: it was shown in [17] that on the data manifold, a\nnonlinear function can be effectively approximated by a globally linear function with respect to the\nlocal coordinate coding. Therefore the LCC approach turns a very dif\ufb01cult high dimensional nonlin-\near learning problem into a much simpler linear learning problem, which can be effectively solved\nusing standard machine learning techniques such as regularized linear classi\ufb01ers. This linearization\nis effective because the method naturally takes advantage of the geometric information.\n\nIn order to describe the results more formally, we introduce a number of notations. First we denote\nby k \u00b7 k the Euclidean norm (2-norm) on Rd:\n\nkxk = kxk2 = qx2\n\n1 + \u00b7 \u00b7 \u00b7 + x2\nd.\n\nDe\ufb01nition 2.1 (Smoothness Conditions) A function f (x) on Rd is (\u03b1, \u03b2, \u03bd) Lipschitz smooth with\nrespect to a norm k \u00b7 k if\n\nk\u2207f (x)k \u2264 \u03b1,\n\nand\n\nand\n\n(cid:12)(cid:12)f (x\u2032) \u2212 f (x) \u2212 \u2207f (x)\u22a4(x\u2032 \u2212 x)(cid:12)(cid:12) \u2264 \u03b2kx\u2032 \u2212 xk2,\n(cid:12)(cid:12)f (x\u2032) \u2212 f (x) \u2212 0.5(\u2207f (x\u2032) + \u2207f (x))\u22a4(x\u2032 \u2212 x)(cid:12)(cid:12)\n\n\u2264\u03bdkx \u2212 x\u2032k3,\n\nwhere we assume \u03b1, \u03b2, \u03bd \u2265 0.\n\nThese conditions have been used in [16], and they characterize the smoothness of f under zero-th,\n\ufb01rst, and second order approximations. The parameter \u03b1 is the Lipschitz constant of f (x), which\nis \ufb01nite if f (x) is Lipschitz; in particular, if f (x) is constant, then \u03b1 = 0. The parameter \u03b2 is\nthe Lipschitz derivative constant of f (x), which is \ufb01nite if the derivative \u2207f (x) is Lipschitz; in\nparticular, if \u2207f (x) is constant (that is, f (x) is a linear function of x), then \u03b2 = 0. The parameter\n\u03bd is the Lipschitz Hessian constant of f (x), which is \ufb01nite if the Hessian of f (x) is Lipschitz;\nin particular, if the Hessian \u22072f (x) is constant (that is, f (x) is a quadratic function of x), then\n\u03bd = 0. In other words, these parameters measure different levels of smoothness of f (x): locally\nwhen kx \u2212 x\u2032k is small, \u03b1 measures how well f (x) can be approximated by a constant function, \u03b2\nmeasures how well f (x) can be approximated by a linear function in x, and \u03bd measures how well\nf (x) can be approximated by a quadratic function in x. For local constant approximation, the error\nterm \u03b1kx\u2212x\u2032k is the \ufb01rst order in kx\u2212x\u2032k; for local linear approximation, the error term \u03b2kx\u2212x\u2032k2\nis the second order in kx \u2212 x\u2032k; for local quadratic approximation, the error term \u03bdkx \u2212 x\u2032k3 is the\nthird order in kx \u2212 x\u2032k. That is, if f (x) is smooth with relatively small \u03b1, \u03b2, \u03bd, the error term\nbecomes smaller (locally when kx \u2212 x\u2032k is small) if we use a higher order approximation.\n\n2\n\n\fSimilar to the single-layer coordinate coding in [17], here we de\ufb01ne a two-layer coordinate coding\nas the following.\n\nDe\ufb01nition 2.2 (Coordinate Coding) A single-layer coordinate coding is a pair (\u03b3 1, C 1), where\nC 1 \u2282 Rd is a set of anchor points (aka basis functions), and \u03b3 is a map of x \u2208 Rd to [\u03b3 1\nv (x)]v\u2208C 1 \u2208\nR|C 1| such that Pv\u2208C 1 \u03b3 1\n\nv (x) = 1. It induces the following physical approximation of x in Rd:\n\nh\u03b3 1,C 1(x) = X\n\nv\u2208C 1\n\n\u03b3 1\nv (x)v.\n\nA two-layer coordinate coding (\u03b3, C) consists of coordinate coding systems {(\u03b3 1, C 1)} \u222a\n{(\u03b3 2,v, C 2,v) : v \u2208 C 1}. The pair (\u03b3 1, C 1) is the \ufb01rst layer coordinate coding, (\u03b3 2,v, C 2,v) are\nsecond layer coordinate-coding pairs that re\ufb01ne the \ufb01rst layer coding for every \ufb01rst-layer anchor\npoint v \u2208 C 1.\n\nThe performance of LCC is characterized in [17] using the following nonlinear function approxima-\ntion result.\n\nLemma 2.1 (Single-layer LCC Nonlinear Function Approximation) Let (\u03b3 1, C 1) be an arbi-\ntrary single-layer coordinate coding scheme on Rd. Let f be an (\u03b1, \u03b2, \u03bd)-Lipschitz smooth function.\nWe have for all x \u2208 Rd:\n\nf (x) \u2212 X\n\nv\u2208C 1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nwhere wv = f (v) for v \u2208 C 1.\n\nwv\u03b3 1\n\nv (x)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2264 \u03b1(cid:13)(cid:13)x \u2212 h\u03b3 1,C 1(x)(cid:13)(cid:13) + \u03b2 X\n\nv\u2208C 1\n\n|\u03b3 1\n\nv (x)|kv \u2212 xk2,\n\n(1)\n\nThis result shows that a high dimensional nonlinear function can be globally approximated by a\nlinear function with respect to the single-layer coding [\u03b3 1\nv (x)], with unknown linear coef\ufb01cients\n[wv]v\u2208C 1 = [f (v)]v\u2208C 1, where the approximation on the right hand size is second order. This\nbounds directly suggests the following learning method: for each x, we use its coding [\u03b3 1\nv (x)] \u2208\nR|C 1| as features. We then learn a linear function of the form Pv wv\u03b3 1\nv(x) using a standard linear\nlearning method such as SVM, where [wv] is the unknown coef\ufb01cient vector to be learned. The\noptimal coding can be learned using unlabeled data by optimizing the right hand side of (1) over\nunlabeled data.\n\nIn the same spirit, we can extend the above result on LCC by including additional layers. This leads\nto the following bound.\n\nLemma 2.2 (Two-layer LCC Nonlinear Function Approximation) Let (\u03b3, C) = {(\u03b3 1, C 1)} \u222a\n{(\u03b3 2,v, C 2,v) : v \u2208 C 1} be an arbitrary two-layer coordinate coding on Rd. Let f be an (\u03b1, \u03b2, \u03bd)-\nLipschitz smooth function. We have for all x \u2208 Rd:\n\nkf (x) \u2212 X\n\nv\u2208C 1\n\nwv\u03b3 1\n\nv(x) \u2212 X\n\nv\u2208C 1\n\nv (x) X\n\u03b3 1\n\nu\u2208C 2,v\n\nwv,u\u03b3 2,v\n\nu (x)k\n\n\u22640.5\u03b1kx \u2212 h\u03b3 1,C 1(x)k + 0.5\u03b1 X\n\nv\u2208C 1\n\nv (x)|kx \u2212 h\u03b3 2,v ,C 2,v (x)k + \u03bd X\n|\u03b3 1\n\nv\u2208C 1\n\n|\u03b3 1\n\nv (x)|kx \u2212 vk3,\n\n(2)\n\nwhere wv = f (v) for v \u2208 C 1 and wv,u = 0.5\u2207f (v)\u22a4(u \u2212 v) for u \u2208 C 2,v, and\n\nkf (x) \u2212 X\n\nv\u2208C 1\n\nv (x) X\n\u03b3 1\n\nu\u2208C 2,v\n\nwv,u\u03b3 2,v\n\nu (x)k\n\n\u2264\u03b1 X\n\nv\u2208C 1\n\nv (x)|kx \u2212 h\u03b3 2,v,C 2,v (x)k + \u03b2 X\n|\u03b3 1\n\nv\u2208C 1\n\n|\u03b3 1\n\nv(x)|kx \u2212 h\u03b3 2,v,C 2,v (x)k2\n\n+ \u03b2 X\n\nv\u2208C 1\n\nv (x)| X\n|\u03b3 1\n\nu\u2208C 2,v\n\n|\u03b3 2,v\n\nu (x)|ku \u2212 h\u03b3 2,v,C 2,v (x)k2,\n\n(3)\n\nwhere wv,u = f (u) for u \u2208 C 2,v.\n\n3\n\n\fSimilar to the interpretation of Lemma 2.1, bounds in Lemma 2.2 implies that we can approximate\na nonlinear function f (x) with linear function of the form\n\nX\n\nv\u2208C 1\n\nwv\u03b3 1\n\nv (x) + X\n\nv\u2208C 1\n\nwv,u\u03b3 1\n\nv (x)\u03b3 2,v\n\nu (x),\n\nX\n\nu\u2208C 2,v\n\nv (x)]v\u2208C 1 and\nu (x)]v\u2208C 1,u\u2208C 2,v form the feature vector. The coding can be learned from unlabeled data by\n\nwhere [wv] and [wv,u] are the unknown linear coef\ufb01cients to be learned, and [\u03b3 1\n[\u03b3 2,v\nminimizing the right hand side of (2) or (3).\n\nCompare with the single-layer coding, we note that the second term on the right hand side of (1)\nis replaced by the third term on the right hand side of (2). That is, the linear approximation power\nof the single-layer coding scheme (with a quadratic error term) becomes quadratic approximation\npower of the two-layer coding scheme (with a cubic error term). The \ufb01rst term on the right hand\nside of (1) is replaced by the \ufb01rst two terms on the right hand of (2). If the manifold is relatively \ufb02at,\nthen the error terms kx \u2212 h\u03b3 1,C 1(x)k and kx \u2212 h\u03b3 2,v,C 2,v (x)k will be relatively small in comparison\nto the second term on the right hand side of (1). In such case the two-layer coding scheme can\npotentially improve the single-layer system signi\ufb01cantly. This result is similar to that of [16], where\nthe second layer uses local PCA instead of another layer of nonlinear coding. However, the bound in\nLemma 2.2 is more re\ufb01ned and speci\ufb01cally applicable to nonlinear coding. The bound in (2) shows\nthe potential of the two-layer coding scheme in achieving higher order approximation power than\nsingle layer coding. Higher order approximation gives meaningful improvement when each |C 2,v|\nis relatively small compared to |C 1|. On the other hand, if |C 1| is small but each |C 2,v| is relatively\nlarge, then achieving higher order approximation does not lead to meaningful improvement. In such\ncase, the bound in (3) shows that the performance of the two-level coding is still comparable to that\nof one-level coding scheme in (1). This is the situation where the 1st layer is mainly used to par-\ntition the space (while its approximation accuracy is not important), while the main approximation\npower is achieved with the second layer. The main advantage of two-layer coding in this case is\nto save computation. This is because instead of solving a single layer coding system with many\nparameters, we can solve many smaller coding systems, each with a small number of parameters.\nThis is the situation when including nonlinearity in the second layer becomes useful, which means\nthat the deep-coding network approach in this paper has some advantage over [16] which can only\napproximate linear function with local PCA in the second layer.\n\n3 Deep Coding Network\n\nWe shall discuss the computational algorithm motivated by Lemma 2.2. While the two bounds (2)\nand (3) consider different scenarios depending on the relative size of the \ufb01rst layer versus the second\nlayer, in reality it is dif\ufb01cult to differentiate and usually both bounds play a role at the same time.\nTherefore we have to consider a mixed effect. Instead of minimizing one bound versus another, we\nshall use them to motivate our algorithm, and design a method that accommodate the underlying\nintuition re\ufb02ected by the two bounds.\n\n3.1 Two Layer Formulation\n\nIn the following, we let C 1 = {v1, . . . , vL1}, \u03b3 1\nj, C 2,vj = {vj,1, . . . , vj,L2}, and\n2,vj\nj,k, where L1 is the size of the \ufb01rst-layer codebook, and L2 is the size of each\nvj,k (Xi) = \u03b3i\n\u03b3\nindividual codebook at the second layer. We take a layer-by-layer approach for training, where the\nsecond layer is regarded as a re\ufb01nement of the \ufb01rst layer, which is consistent with Lemma 2.2. In\nthe \ufb01rst layer, we learn a simple sparse coding model with all data:\n\nvj (Xi) = \u03b3i\n\n[\u03b3 1, C 1] = arg min\n\u03b3,v\n\nXi \u2212\n\nL1\n\nX\n\nj=1\n\n\u03b3i\njvj\n\n2\n\n2\n\n\uf8f6\n\uf8f7\uf8f8\n\n\uf8f9\n\uf8fa\uf8fb\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nn\n\n1\n2\n\n\uf8ee\nX\n\uf8ef\uf8f0\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n\uf8eb\n\uf8ec\uf8ed\nj \u2265 0,X\n\ni=1\n\nj\n\nsubject to \u03b3i\n\n\u03b3i\nj = 1, kvjk \u2264 \u03ba,\n\n(4)\n\nwhere \u03ba is some constant, e.g., if all Xi are normalized to have unit length, \u03ba can be set to be\n1. For convenience, we not only enforce sum-to-one-constraint on the sparse coef\ufb01cients, but also\n\n4\n\n\fj = 1 for all i. This presents a probability\nimpose nonnegative constraints so that Pj |\u03b3i\ninterpretation of the data, and allow us to approximate the following term on the right hand sides of\n(2) and (3):\n\nj| = Pj \u03b3i\n\n\u03b3i\nj\n\nX\n\nj\n\nXi \u2212\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nL2\n\nX\n\nk=1\n\n\u03b3i\n\nj,kvj,k(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n\u2264 \uf8eb\n\uf8edX\n\nj\n\n\u03b3i\nj\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nXi \u2212\n\n1/2\n\n.\n\nL2\n\nX\n\nk=1\n\n\u03b3i\n\nj,kvj,k(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n2\uf8f6\n\uf8f8\n\nNote that neither sum to one or 1-norm regularization of coef\ufb01cients is needed in the derivation of\n(2), while such constraints are needed in (3). This means additional constraints may hurt perfor-\nmance in the case of (2) although it may help in the case of (3). Since we don\u2019t know which case\nis the dominant effect, as a compromise we remove the sum-to-one constraint but put in 1-norm\nregularization which is tunable. We still keep the positivity constraint for interpretability. This leads\nto the following formulation for the second layer:\n\n[\u03b3 2,vj , C 2,vj ] = arg min\n\u03b3,v\n\nn\n\n\uf8ee\n\uf8f0\nsubject to \u03b3i\n\nX\n\ni=1\n\n\u03b3i\nj\n\n\uf8eb\n\uf8ed\n\n1\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nXi \u2212\n\nj,k \u2265 0, kvj,kk \u2264 1,\n\nL2\n\nX\n\nk=1\n\n\u03b3i\n\n2\n\n2\n\nj,kvj,k(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n+ \u03bb2\n\nL2\n\nX\n\nk=1\n\n\u03b3i\nj,k\n\n\uf8f6\n\uf8f8\n\n\uf8f9\n\uf8fb\n\n(5)\n\nwhere \u03bb2 is a l1-norm sparsity regularization parameter controlling the sparseness of solutions. With\nthe codings on both layers, the sparse representation of Xi is (cid:2)s\u03b3i\nj,1, \u03b3i\nj,L2](cid:3)j=1,...L1\nwhere s is a scaling factor balances the coding from the two different layers.\n\nj,2, ..., \u03b3i\n\nj, \u03b3i\n\nj[\u03b3i\n\n3.2 Multi-layer Extension\n\nThe two-level coding scheme can be easily extended to the third and higher layers. For example,\nat the third layer, for each base vj,k, the third-layer coding is to solve the following weighted opti-\nmization:\n\n[\u03b3j,k\n\n3 , Cj,k\n\n3 ] = arg min\n\u03b3,v\n\nn\n\n\uf8ee\nX\n\uf8f0\nsubject to \u03b3i\n\ni=1\n\n\u03b3i\nj,k\n\n\uf8eb\n\uf8ed\n\n1\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nXi \u2212\n\nL3\n\nX\n\nl=1\n\n\u03b3i\n\nj,k,lvj,k,l(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n2\n\n2\n\n+ \u03bb3 X\n\nl\n\n\u03b3i\nj,k,l\n\n\uf8f6\n\uf8f8\n\n\uf8f9\n\uf8fb\n\n(6)\n\nj,k,l \u2265 0, kvj,k,lk \u2264 1.\n\n3.3 Optimization\n\nThe optimization problems in Equations (4) to (6) can be generally solved by alternating the follow-\ning two steps: 1) given current cookbook estimation v, compute the optimal sparse coef\ufb01cients \u03b3;\n2) given the new estimates of the sparse coef\ufb01cients, optimize the cookbooks.\n\nStep 1 requires solving an independent optimization problem for each data sample, and it can be\ncomputationally very expensive when there are many training examples. In such case, computa-\ntional ef\ufb01ciency becomes an important issue. We developed some ef\ufb01cient algorithms for solving\nthe optimizations problem in Step 1 by exploiting the fact that the solutions of the optimization\nproblems are sparse. The optimization problem in Step 1 of (4) can be posed as a nonnegative\nquadratic programming problem with a single sum-to-one equality constraint. We employ an ac-\ntive set method for this problem that easily handles the constraints [4]. Most importantly, since the\noptimal solutions are very sparse, the active set method often gives the exact solution after a few\ndozen of iterations. The optimization problem in (5) contains only nonnegative constraints (but not\nthe sum-to-one constraint), for which we employ a pathwise projected Newton (PPN) method [3]\nthat optimizes a block of coordinates per iteration instead of one coordinate at a time in the active\nset method. As a result, in typical sparse coding settings (for example, in the experiments that we\nwill present shortly in Section 4), the PPN method is able to give the exact solution of a median size\n(e.g. 2048 dimension) nonnegative quadratic programming problem in milliseconds.\n\nStep 2 can be solved in its dual form, which is convex optimization with nonnegative constraints [9].\nSince the dual problem contains only nonnegative constraints, we can still employ projected Newton\nmethod.\nIt is known that the projected Newton method has superlinear convergence rate under\n\n5\n\n\ffairly mild conditions [3]. The computational cost in Step 2 is often negligible compared to the\ncomputational cost in Step 1 when the cookbook size is no more than a few thousand.\n\nA signi\ufb01cant advantage of the second layer optimization in our proposal is parallelization. As shown\nin (5), the second-layer sparse coding is decomposed into L1 independent coding problems, and thus\ncan be naturally parallelized. In our implementation, this is done through Hadoop.\n\n4 Experiments\n\n4.1 MNIST dataset\n\nWe \ufb01rst demonstrate the effectiveness of the proposed deep coding scheme on the popular MNIST\nbenchmark data [1]. MNIST dataset consists of 60,000 training digits and 10,000 testing digits. In\nour experiments of deep coding network, the entire training set is used to learn the \ufb01rst-layer coding,\nwith codebook of size 64. For each of the 64 bases in the \ufb01rst layer, a second-layer codebook was\nlearned \u2013 the deep coding scheme presented in the paper ensures that the codebook learning can\nbe done independently. We implemented a Hadoop parallel program that solved the 64 codebook\nlearning tasks in about an hour \u2013 which would have taken 64 hours on single machine. This shows\nthat easy parallelization is a very attractive aspect of the proposed deep coding scheme, especially\nfor large scale problems.\n\nTable 1 shows the performance of deep coding network on MNIST compared to some previous\ncoding schemes. There are a number of interesting observations in these results. First, adding\nan extra layer yields signi\ufb01cant improvement on classi\ufb01cation; e.g. for L1 = 512, the classi\ufb01cation\nerror rate for single layer LCC is 2.60% [17] while extended LCC achieves 1.98% [16] (the extended\nLCC method in [16] may also be regarded as a two layer method but the second layer is linear); the\ntwo-layer coding scheme here signi\ufb01cantly improves the performance with classi\ufb01cation error rate\nof 1.51% . Second, the two-layer coding is less prone to over\ufb01tting than its single-layer counterpart.\nIn fact, for the single-layer coding, our experiment shows that further increasing the codebook size\nwill cause over\ufb01tting (e.g., with L1 = 8192, the classi\ufb01cation error deteriorates to 1.78%).\nIn\ncontrast, the performance of two-layer coding still improves when the second-layer codebook is as\nlarge as 512 (and the total codebook size is 64 \u00d7 512 = 32768, which is very high-dimensional\nconsidering the total number of training data is only 60,000). This property is desirable especially\nwhen high-dimensional representation is preferred in the case of using sparse coding plus linear\nclassi\ufb01er for classi\ufb01cations.\n\nFigure 1 shows some \ufb01rst-layer bases and their associated second-layer bases. We can see that the\nsecond-layer bases provide deeper details that helps to further explain their \ufb01rst layer parent basis;\non the other hand, the parent \ufb01rst-layer basis provides an informative context for its child second-\nlayer bases. For example, in the seventh row in Fig. 1 where the \ufb01rst-layer basis is like Digit 7, this\nbasis can come from Digit 7, Digit 9 or even Digit 4. Then, its second-layer bases help to further\nexplain the meaning of the \ufb01rst-layer basis: in its associated second-layer bases, the \ufb01rst two bases\nin that row are parts of Digit 9 while the last basis in that row is a part of Digit \u20194\u2019. Meanwhile, the\n\ufb01rst-layer 7-like basis provides important context for its second-layer part-like bases \u2013 without the\n\ufb01rst-layer basis, the fragmented parts (like the \ufb01rst two second-layer bases in that row) may not be\nvery informative. The zoomed-in details contained in deeper bases signi\ufb01cantly help a classi\ufb01er to\nresolve dif\ufb01cult examples, and interestingly, coarser details provide useful context for \ufb01ner details.\n\nNumber of bases (L1)\nLocal coordinate coding\nExtended LCC\n\nSingle layer sparse coding\n1024\n2.17\n1.82\nTwo-layer sparse coding\n\n512\n2.60\n1.95\n\n2048\n1.79\n1.78\n\n4096\n1.75\n1.64\n\nNumber of bases (L2)\nL1 = 64\n\n64\n1.85\n\n128\n1.69\n\n256\n1.53\n\n512\n1.51\n\nTable 1: The classi\ufb01cation error rate (in %) on MNIST dataset with different sparse coding schemes.\n\n6\n\n\fFirst\u2212layer bases\n\nSecond\u2212layer bases\n\n \n\n \n\nFigure 1: Example of bases from a two-layer coding network on MNIST data. For each row, the\n\ufb01rst image is a \ufb01rst-layer basis, and the remaining images are its associated second-layer bases.\nThe colorbar is the same for all images, but the range it represents differs from image to image \u2013\ngenerally, the color of the background of a image represent zero value, and the colors above and\nbelow that color respectively represent positive and negative values.\n\n4.2 PASCAL 2007\n\nThe PASCAL 2007 dataset [6] consists of 20 categories of images such as airplanes, persons, cats,\ntables, and so on. It consists of 2501 training images and 2510 validation images, and the task is to\nclassify an image into one or more of the 20 categories. Therefore, this task can be casted as training\n20 binary classi\ufb01ers. The critical issue is how to extract effective visual features from the images.\nAmong different methods, one particularly effective approach is to use sparse coding to derive a\ncodebook of low-level features (such as SIFT) and represent an image as a bag of visual words [15].\nHere, we intend to learn two-layer hierarchical codebooks instead of single \ufb02at codebook for the\nbag-of-word image representation.\n\nIn our experiments, we \ufb01rst sampled dense SIFT descriptors (each is represented by a 128\u00d71 vector)\non each image using four scales, 7 \u00d7 7, 16 \u00d7 16, 25 \u00d7 25 and 31 \u00d7 31 with stepsize of 4. Then,\nthe SIFT descriptors from all images (both training and validation images) were utilized to learn\n\ufb01rst-layer codebooks with different dimensions, L1 = 512, 1024 and 2048. Then, given a \ufb01rst-\nlayer codebook, for each basis in the codebook, we learned its second-layer codebook of size 64\nby solving the weighted optimization in (5). Again, the second-layer codebook learning was done\nin parallel using Hadoop. With the \ufb01rst-layer and second-layer codebooks, each SIFT feature was\ncoded into a very high dimensional space: using L1 = 1024 as an example, the coding dimension\n\n7\n\n\fDimension of the \ufb01rst layer (L1)\nSingle-layer sparse coding\nTwo-layer sparse coding (L2=64)\n\n512\n42.7\n51.1\n\n1024\n45.3\n52.8\n\n2048\n48.4\n53.3\n\nTable 2: Average precision (in %) of classi\ufb01cation on PASCAL07 dataset using different sparse\ncoding schemes.\n\nin total is 1024 + 1024 \u00d7 64 = 66, 560. For each image, we employed 1 \u00d7 1, 2 \u00d7 2 and 1 \u00d7 3\nspatial pyramid matching with max-pooling. Therefore in the end, each image is represented by a\n532, 480(= 66, 560\u00d78)\u00d71 high-dimensional vector for L1 = 1024. Table 2 shows the classi\ufb01cation\nresults. It is clear that the two-layer sparse coding performs signi\ufb01cantly better than its single-layer\ncounterpart.\n\nWe would like to point out that, although we simply employed max-pooling in the experiments, it\nmay not be the best pooling strategy for the hierarchical coding scheme presented in this paper. We\nbelieve a better pooling scheme needs to take the hierarchical structure into account, but this remains\nas an open problem and is one of our future work.\n\n5 Conclusion\n\nThis paper proposes a principled extension of the traditional single-layer \ufb02at sparse coding scheme,\nwhere a two-layer coding scheme is derived based on theoretical analysis of nonlinear functional\napproximation that extends recent results for local coordinate coding. The two-layer approach can be\neasily generalized to deeper structures in a hierarchical multiple-layer manner. There are two main\nadvantages of multi-layer coding: it can potentially achieve better performance because the deeper\nlayers provide more details and structures; it is computationally more ef\ufb01cient because coding are\ndecomposed into smaller problems. Experiment showed that the performance of two-layer coding\ncan signi\ufb01cantly improve that of single-layer coding.\n\nFor the future directions, it will be interesting to explore the deep coding network with more than\ntwo layers. The formulation proposed in this paper grants a straightforward extension from two\nlayers to multiple layers. For small datasets like MNIST, the two-layer scheme seems to be already\nvery powerful. However, for more complicated data, deeper coding with multiple layers may be an\neffective way for gaining \ufb01ner and \ufb01ner features. For example, the \ufb01rst layer coding picks up some\nlarge categories such as human, bikes, cups, and so on; then for the human category, the second-\nlayer coding may \ufb01nd difference among adult, teenager, and senior person; and then the third layer\nmay \ufb01nd even \ufb01ner features such as race feature at different ages.\n\nReferences\n\n[1] http://yann.lecun.com/exdb/mnist/.\n[2] Samy Bengio, Fernando Pereira, Yoram Singer, and Dennis Strelow. Group sparse coding. In\n\nNIPS\u2019 09, 2009.\n\n[3] D P. Bertsekas. Projected newton methods for optimization problems with simple constraints.\n\nSIAM J. Control Optim., 20(2):221\u2013246, 1982.\n\n[4] Dimitri P. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, 2003.\n[5] David Bradley and J. Andrew (Drew) Bagnell. Differentiable sparse coding. In Proceedings\n\nof Neural Information Processing Systems 22, December 2008.\n\n[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.\n\nThe\nPASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-\nnetwork.org/challenges/VOC/voc2007/workshop/index.html.\n\n[7] Mark Everingham. Overview and results of the classi\ufb01cation challenge. The PASCAL Visual\n\nObject Classes Challenge Workshop at ICCV, 2009.\n\n[8] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural net-\n\nworks. Science, 313(5786):504 \u2013 507, July 2006.\n\n8\n\n\f[9] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Ef\ufb01cient sparse coding algo-\n\nrithms. In Proceedings of the Neural Information Processing Systems (NIPS) 19, 2007.\n\n[10] Michael S. Lewicki and Terrence J. Sejnowski. Learning overcomplete representations. Neural\n\nComputation, 12:337\u2013365, 2000.\n\n[11] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In\n\nNIPS\u2019 08, 2008.\n\n[12] B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning\n\na sparse code for nature images. Nature, 381:607\u2013609, 1996.\n\n[13] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y. Ng. Self-taught\nlearning: Transfer learning from unlabeled data. International Conference on Machine Learn-\ning, 2007.\n\n[14] Marc Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun. Sparse feature learning for deep\n\nbelief networks. In NIPS\u2019 07, 2007.\n\n[15] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching\nIn IEEE Conference on Computer Vision and\n\nusing sparse coding for image classi\ufb01cation.\nPattern Recognition, 2009.\n\n[16] Kai Yu and Tong Zhang. Improved local coordinate coding using local tangents. In ICML\u2019 09,\n\n2010.\n\n[17] Kai Yu, Tong Zhang, and Yihong Gong. Nonlinear learning using local coordinate coding. In\n\nNIPS\u2019 09, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1077, "authors": [{"given_name": "Yuanqing", "family_name": "Lin", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": ""}, {"given_name": "Shenghuo", "family_name": "Zhu", "institution": null}, {"given_name": "Kai", "family_name": "Yu", "institution": null}]}