{"title": "A General Greedy Approximation Algorithm with Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 1065, "page_last": 1072, "abstract": null, "full_text": "A General Greedy Approximation Algorithm\n\nwith Applications\n\nIBM T.J. Watson Research Center\n\nYorktown Heights, NY 10598\n\nTong Zhang\n\ntzhang@watson.ibm.com\n\nAbstract\n\nGreedy approximation algorithms have been frequently used to obtain\nsparse solutions to learning problems. In this paper, we present a general\ngreedy algorithm for solving a class of convex optimization problems.\nWe derive a bound on the rate of approximation for this algorithm, and\nshow that our algorithm includes a number of earlier studies as special\ncases.\n\n1 Introduction\n\nThe goal of machine learning is to obtain a certain input/output functional relationship from\na set of training examples. In order to do so, we need to start with a model of the functional\nrelationship. In practice, it is often desirable to \ufb01nd the simplest model that can explain the\ndata. This is because simple models are often easier to understand and can have signi\ufb01cant\ncomputational advantages over more complicated models. In addition, the philosophy of\nOccam\u2019s Razor implies that the simplest solution is likely to be the best solution among all\npossible solutions,\n\nIn this paper, we are interested in composite models that can be expressed as linear com-\nbinations of basic models. In this framework, it is natural to measure the simplicity of a\ncomposite model by the number of its basic model components. Since a composite model\nin our framework corresponds to a linear weight over the basic model space, therefore our\nmeasurement of model simplicity corresponds to the sparsity of the linear weight represen-\ntation.\n\nIn this paper, we are interested in achieving sparsity through a greedy optimization algo-\nrithm which we propose in the next section. This algorithm is closely related to a number of\nprevious works. The basic idea was originated in [5], where Jones observed that if a target\nvector in a Hilbert space is a convex combination of a library of basic vectors, then using\ngreedy approximation, one can achieve an error rate of \u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\t with \u0007 basic library vec-\ntors. The idea has been re\ufb01ned in [1] to analyze the approximation property of sigmoidal\nfunctions including neural networks.\n\nThe above methods can be regarded as greedy sparse algorithms for functional approxi-\nmation, which is the noise-free case of regression problems. A similar greedy algorithm\ncan also be used to solve general regression problems under noisy conditions [6]. In ad-\ndition to regression, greedy approximation can also be applied to classi\ufb01cation problems.\n\n\fThe resulting algorithm is closely related to boosting [2] under the additive model point of\nview [3]. This paper shows how to generalize the method in [5, 1] for analyzing greedy\nalgorithms (in their case, for functional approximation problems) and apply it to boosting.\nDetailed analysis will be given in Section 4. Our method can also be used to obtain sparse\nkernel representations for regression problems. Such a sparse representation is what sup-\nport vector regression machines try to achieve. In this regard, the method given in this\npaper complements some recently proposed greedy kernel methods for Gaussian processes\nsuch as [9, 10].\n\nThe proposed greedy approximation method can also be applied to other prediction prob-\nlems with different loss functions. For example, in density estimation, the goal is to \ufb01nd a\nmodel that has the smallest negative log-likelihood. A greedy algorithm was analyzed in\n[7]. Similar approximation bounds can be directly obtained under the general framework\nproposed in this paper.\n\nWe proceed as follows. Section 2 formalizes the general class of problems considered in\nthis paper, and proposes a greedy algorithm to solve the formulation. The convergence rate\nof the algorithm is investigated in Section 3. Section 4 includes a few examples that can be\nobtained from our algorithm. Some \ufb01nal concluding remarks are given in Section 5.\n\n\u0006\b\u0001\n\n\t can be\n\n\u0001\u0007\u0006\b\n\u0006\b\n\nthat is problem dependent.\n\nfrom a set of example pairs of \u0001\n\n. This requires us to estimate a functional relationship \u0003\u0002\u0005\u0004\n\n\t . Usually the quality of the predictor \u0004\n\n\t parameterized by \n , we want to obtain a good predictor \u0004\n\u0001\f\n\n\u0013\u0015\u0014\u0007\u0016\u0018\u0017\n\u0006\u0019\u0001\n\n\t with the fewest possible terms: \u0004\n\nIn this paper, we are interested in the following scenario: given a family of basic predictors\nthat lies in the convex\n\n2 General Algorithm\nIn machine learning, our goal is often to predict an unobserved output value  based on an\nobserved input vector \u0001\nmeasured by a loss function \t\n\u0001\u000b\n\n\u0006\b\u0001\n\t\u000e\r\u0010\u000f\u0012\u0011\nhull of\u0004\nnegative weights so that \u000f\u001a\u0011\n\u0013\u001b\u0014\u0007\u0016\u0018\u0017\n\u0001\f\n\nmodels in statistics [4]. Formally, each basic model \u0004\nin the convex hull of \u0004\nto minimize a functional \u001c of \u0004\nas to \ufb01nd a vector \u0004\n. This functional \u001c of\u0004 plays the role of loss function for learning\nmeasures the quality of\u0004\n. Denote by \"$#\n, and a subset \u001e \u001f!\u001d\nMore formally, we consider a linear vector space \u001d\nthe convex hull of \u001e\n\"$#\n\n\u0013 are non-\n\u0003 . This family of models can be regarded as additive\n\t can be regarded as a vector\nin a linear functional space. Our problem in its most general form can thus be described\nthat\n\n\u001f8\u001e9\u0006\u0019:;\u001f=<?>A@B\u0006\n\n\t , where \u0017\n\n\t%\r'&)(\n\n+/.0+21\n\n+2354\n\n\u0001\u000b\n\n\u0006\u0019\u0001\n\nproblems.\n\n\u0001\f\n\n\u0006\b\u0001\n\n:\n\n+\u0019\u0014\u0007\u0016-,\n\n+\u0019\u0014\u0007\u00166,\n\nto denote the set of positive integers.\n\nwhere we use <\nWe consider the following optimization problem on \"$#\n\t$R\n\n\u0001\u000bQ\n\nC\u001bDFE\nGIHKJML$N\u001bOBP\n\n.7+\n\n\t :\n\nIn this paper, we assume that \u001c\n\nis a differentiable convex function on \"/#\n\nWe propose the following algorithm to approximately solved (1).\n\nAlgorithm 2.1 (Sparse greedy approximation)\n\n(1)\n\n\t .\n\n\u0001\n\u0001\n\t\n\u0001\n\u0001\n\u0001\n\u0004\n\u0001\n\u0001\n\t\n\t\n\u0004\n\u0001\n\u0001\n\t\n\u0001\n\u0001\n\u0013\n\u0004\n\u0013\n\u0013\n\n\u0001\n\u0001\n\t\n\t\n\u0001\n\u001e\n\t\n\u0001\n\u001e\n*\n,\n\u0006\n(\n*\n+\n\n\u0003\n\u0006\n>\n\u0001\n\u001e\n\u001c\n\u0001\n\u001e\n\ffor \u0007\n\ngiven Q\u0001\n\ufb01nd \u0003\nlet Q\n\nend\n\nR/R\n\n\u001f8\"/#\n\u001f=\u001e and 4\u0005\u0004\n\u0003\u0007\u0006\n\u0011\t\b\n\u0001\u0004\u0003\u0007\u0006\n\tMQ\n\n\u0016\u000b\n\n\u0011\t\b\n\nthat minimize\n\n\u0001\r\f\b\t\n\nFor simplicity, we assume that the minimization of \u0001\u000e\f\b\t\n\nin Algorithm 2.1 can be exactly\nachieved at each step. This assumption is not essential, and can be easily removed using a\nslightly more re\ufb01ned analysis. However due to the space limitation, we shall not consider\nthis generalization.\n\nFor convenience, we introduce the following quantity\n\n\u0001\u000bQ\n\n\t%\n\n\t\u0010\u0006\n\n\u0001\fQ\n\n\u0001\fQ\u0014\u0013\n\n\t$R\n\nC\u0015D7E\nG\u0012\u0011\fHKJML/N\u001bOBP\n\nIn the next section, we show that under appropriate regularity conditions, \u000f\n\u0007\u0017\u0015\n\n, where Q\ncan be bounded as \u0002\u0001\n\nis computed from Algorithm 2.1. In addition, the convergence rate\n\n4 as\n\n\u0019\u0018\n\n\t\u0016\u0015\n\n\u0007\n\t .\n\n\u0001\u000bQ\n\n, we have the following proposition, which is a direct conse-\n\nby the concept of subgradient, which we do not consider in this paper for simplicity.\n\n3 Approximation bound\nGiven any convex function \u001c\nquence of the de\ufb01nition of convexity. In convex analysis, The gradient \u001a\n\t , and two vectors Q and Q\nProposition 3.1 Consider a convex function \u001c\n\t\u001c\u001b\n\u0001\u000bQ\nwhere \u001a\nmance of each greedy solution step in Algorithm 2.1. We assume that \u001c\n\nis the gradient of \u001c\n\ndifferentiable.\n\n\t\u0010\u0006\n\n\u0001\u000bQ\n\n\u0001\u000bQ\n\n\u0001\fQ\n\n\u0001\fQ\n\n.\n\nThe following lemma is the main theoretical result of the paper, which bounds the perfor-\nis second order\n\n\u001c can be replaced\n\u0013 , we have\n\nG\t#\n\n\u0001\fQ\n\nQ\u0014\u001b\n\n\u001a\u0005$\n\n\u001e \u001f\"!\nHKJML$N\u001bOBP\n\u001c of \u001c exists everywhere in \"$#\n\u0003\u0007\u0006,+\n\tMQ\n\n\t . For all vectors\n\nLemma 3.1 Let\n\n\t : if \u000f\n\nwhere we assume that the Hessian \u001a\n\u001f!\"$#\nG\u0012\u0011\n\n3&%\n\nC\u001bDFE\n\u0016\r*\n\nH)(\n\n\u0001\fQ\n\n, we have\n\n, we have\n\nif \u000f\n\n\u0001\u000bQ\n\n\u0004&%\n\nC\u001bDFE\n\u0016\r*\n\nH)(\n\n\u0001\u0004\u0003.\u0006,+\n\tMQ\n\n\t\u0010\u0006\n\n\u0001\fQ\n\n\u0001\u000bQ\n\nProof. Using Taylor expansion and the de\ufb01nition of \u001d\nfor all Q\n\n\t , Q\n\n, and+\n+BQ\u0014\u0013\n\n\u001f\u00170\n\t\u0010\u0006\n\n\u001f!\u001e\n\u0003\u0007\u0006,+\n\tMQ\n\n\u000321 ,\n\u0001\u000bQ\n\n\u0001\fQ\u0014\u00133\u0006\n\n\u001f8\"$#\n\n, we have the following inequality\n\n\u0001\u000bQ\n\n\t$R\n\n\u0001\n\u001e\n\t\n\n\u0003\n\u0006\n\u0002\n\u0006\nR\nQ\n\u0011\n,\n\u0011\n\u0004\n\u0003\n\u001c\n\u0001\n\u0001\n,\n\u0011\n\t\nQ\n,\n\u0011\n\u0003\nQ\n\u0011\n\t\n\u0011\n\n,\n\u0011\n\u0016\n\n,\n\u0011\n\u0003\nQ\n\u0011\n\u000f\n\u001c\n\u001c\n\u001c\n\u001c\n\u0011\n\u0011\n\u0003\n\u0005\n\u001c\n\u0013\n\u001c\n\t\n3\n\u0013\n\u0006\nQ\n\u001a\n\u001c\n\t\n\u0006\n\u001c\n\u001d\n\nG\n\u0011\n\u001c\n\u0013\n\t\nQ\n\u0006\n$\n\u0001\n\u001e\nQ\n\u0001\n\u001e\n\u001c\n\t\n\u001d\n'\n\n#\n#\nH\nO\n\u000f\n\u001c\n\u0001\n\u0001\n\n+\nQ\n\u0013\n\t\n\u0004\n\u0002\n\u001d\n-\n\u001c\n\t\n\u001d\n'\n\n#\n#\nG\n\u0011\nH\nO\n\u000f\n\u001c\n\u0001\n\n+\nQ\n\u0013\n\t\n\u0004\n\u000f\n\u001c\n\u000f\n\u001c\n\t\n$\n/\n\u001d\nR\n\u0001\n\u001e\n\u0013\n4\n\u0006\n\u001c\n\u0001\n\u0001\n\n\u001c\n\t\n\u0004\n+\nQ\n\t\n\u001b\n\u001a\n\u001c\n\t\n\n+\n$\n\u0002\n\u0001\n%\n\u001d\n\fNow, consider two sequences\n\n+2354 and Q\n(\n\u0013 replaced by Q\nMultiply the above inequality (with Q\n\t\u0010\u0006\n\u0001\u000bQ\n\n\u0003\u0007\u0006,+\n\t\n\n\u001f=\u001e\n\nIt is easy to see that this implies the inequality\n\nC\u001bDFE\n\n\u0001\u0004\u0003\u0007\u0006\n\n+\n\tMQ\n\n\t\u0010\u0006\n\n\u0001\fQ\n\nUsing Proposition 3.1, we obtain\n\n\u0001A(\n+\u0019\u0014\u0007\u0016\n\n+\u0019\u0014\u0007\u0016\n\n+ ) by\n\n), such that \u000f\n\nR/R\n\u0006\b:\n+ , and sum over  , we obtain\n\n+\u0019\u0014\u0007\u0016\n\n\u0003 .\n\n\u0001\u000bQ\n\n\u001a)\u001c\n\n\u0001\fQ\n\n+\u0019\u0014\u0007\u0016\n\nC\u001bDFE\n\n\u0001\u0004\u0003\u0007\u0006\n\n+\n\tMQ\n+ and Q\n\u0003\u0007\u0006,+\n\tMQ\nC\u0015D\n\n\u0001\u0004\u0003\n\n\u0001\fQ\n\nQ\u0001\u0013\n\u0001\u000e(\n+\u0019\u0014\u0007\u0016\n+ are arbitrary, therefore \u000f\nC\u001bDFE\n\n\u0001\u000bQ\n\n\u0001\u000bQ\n+ can be used to express\n\nQ\u0014\u0013\n+\u0019\u0014\u0007\u0016\n\t\u0010\u0006\n\n\u0001\fQ\n\nin the above inequality, we obtain\n\nSince in the above,\n\n\t . This implies\n\nany vector Q\n\u001f8\"$#\nC\u001bDFE\nNow by setting +\nC\u0015DFE\nthe lemma. \u0004\nUsing the above lemma and note that \u000f\n\n\u0003\u0002\n\n\u0001\u000bQ\n\n\u0001\fQ\n\ntheorem by induction. For space limitation, we skip the proof.\n\n, it is easy to obtain the following\n\nTheorem 3.1 Under the assumptions of Lemma 3.1, Algorithm 2.1 approximately solves\n(1), and the rate of convergence for \u0007\n\nis given by\n\nIf \u000f\n\n\u0001\fQ\u0001\u0006\t\n\n\u0004&%\n\n, then we also have\n\n4 Examples\n\n\u0001\fQ\n\n\u0001\fQ\n\n\u0006\u0005\n\n\u0007\t\b\n\n\f\u000b\n\nN\u0015G\u000e\r\bP\n\nIn this section, we discuss the application of Algorithm 2.1 in some learning problems.\nWe show that the general formulation considered in this paper includes some previous\nformulations as special cases. We will also compare our results with similar results in the\nliterature.\n\n4.1 Regression\n\nIn regression, we would like to approximate  as\u0004\n\n\u0001\u0010\u000f\n\n\t\u000e\r\u0012\u0011\u0014\u0013\n\n\t so that the expected loss of\n\nis small, where we use the squared loss for simplicity (this choice is obviously not crucial\nin our framework).\n, which often corresponds to the\n\t pairs. It may also represent the true distribution for some\nempirical distribution of \u0001\n\nis the expectation over \u0001 and \n\u0001\u0007\u0006\b\n\n,\n\u0013\n+\n\n\u0003\n\u0006\nR\n(\n,\n+\n\n\u0013\n,\n(\n*\n,\n+\n\u001c\n\u0001\n\u0001\nQ\n\n+\nQ\n\u0013\n+\n\u001c\n\t\n\u0004\n+\n\u0001\n(\n*\n,\n+\nQ\n\u0013\n+\n\u0006\nQ\n\t\n\u001b\n\t\n\n\u0002\n+\n$\n\u001d\nR\n+\n\u001c\n\u0001\n\n+\nQ\n\u0013\n+\n\u001c\n\t\n\u0004\n+\n*\n,\n+\nQ\n\u0013\n+\n\u0006\nQ\n\t\n\u001b\n\u001a\n\u001c\n\t\n\n\u0002\n+\n$\n\u001d\nR\n+\n\u001c\n\u0001\n\n+\n+\n\t\n\u0006\n\u001c\n\t\n\u0004\n+\n\u0001\n\u001c\n*\n,\n+\n+\n\t\n\u0006\n\u001c\n\t\n\t\n\n\u0002\n+\n$\n\u001d\nR\n,\n\u0013\n(\n,\n+\nQ\n\u0013\n\u0001\n\u001e\nG\n\u0011\nH\nO\n\u001c\n\u0001\n\u0001\n\n+\nQ\n\u0013\n\t\n\u0004\n\u001c\n\t\n\n+\n\u0001\n\u0001\nG\n\u001c\n\u0001\n\u0003\nQ\n\u001c\n\t\n\t\n\n\u0002\n+\n$\n\u001d\nR\n\u0006\n\u0001\n\u001c\n\t\n\u0006\n\u0001\nG\n\u001c\n\u0001\n\u0003\nQ\n\t\n\t\n\u0005\n\u0001\n%\n\u001d\n\t\n\t\n\u001c\n\u0016\n\t\n\u0004\n\u0002\n\u001d\n3\n\u0003\n\u000f\n\u001c\n\u0011\n\t\n\u0004\n/\n\u001d\n\u0007\nR\n\u001c\n\u001d\n\u000f\n\u001c\n\u0011\n\t\n\u0004\n/\n\u001d\n\u0007\n\nR\n\u0001\n\u0001\n\u001c\n\u0001\n\u0004\n\t\n#\n\u0015\n\u0001\n\n\u0006\n\u0004\n\u0001\n\u0001\n\t\n\t\n$\n\u0011\n\u0013\n#\n\u0015\n\fother engineering applications. Given a set of basis functions \u0004\n\nconsider the following regression formulation that is slightly different from (1):\n\n\u0006\b\u0001\n\n\u0001\u000b\n\n\t with \n\n\u001f\u0001\n\n, we may\n\n(2)\n\nC\u0015DFE\n\ns.t.\n\n\u0013\u0015\u0014\n\n\u0001\u000b\n\n\u0006\b\u0001\n\n\u0013\u001b\u0014\u0007\u0016\n\u0004\u0005\u0004\n\nis a positive regularization parameter which is used to control the size of the\n. The above formulation can be readily converted into (1) by considering\n\nwhere \u0004\nweight vector \u0017\nthe following set \u001e of basic vectors:\n&\u0007\u0006\n\u0001\f\n\n4 ) in Algorithm 2.1. Since the quantity \u001d\n\nWe may start with \u0017\n\ncan be bounded as\n\nin Lemma 3.1\n\n\u001f\t\n\n\u0006\u0019\u0001\n\n(Q\n\n\u0004\b\u0004\n\u0001\u000b\n\n\u0011\u0014\u0013\n\n\u001f\"!\n\nin Algorithm 2.1, represented as weight\n\nThis implies that the sparse solution Q\nand \n\nR/R\n\n\u0013 (\n\n\u0006\b\u0001\n\u0007 ), satis\ufb01es the following inequality:\n\u0001\f\n\n\u0006\b\u0001\n\u0013\u001b\u0014\u0007\u0016\n\u0003 . This leads to the original functional approximation results in [1, 5] and its\n\n\u0006\u0005\n\nC\u001bDFE\n\r\u000f\u000e\u0011\u0010\u0013\u0012\n\n+\u0019\u0014\u0007\u0016\n\n\u0006\b\u0001\n\n\u0006\b\u0001\n\n\u0001\f\n\n\u0003\u0016\u0015\n\n\u0004\b\u0004\n\u0001\f\n\n\u001f3!\n\nfor all \u0007\ngeneralization in [6].\n\nThe sparse regression algorithm studied in this section can also be applied to kernel meth-\nods. In this case,\n\ncorresponds to the input training data space &\n\npredictors are of the form \u0004\n\n\t . Clearly, this corresponds to a special case of\n(2). A sparse kernel representation can be obtained easily from Algorithm 2.1 which leads\nto provably good approximation rate. Our sparse kernel regression formulation is related\nto Gaussian processes, where greedy style algorithms have also been proposed [9, 10]. The\nbound given here is comparable to the bound given in [10] where a sparse approximation\nrate of the form \u0002\u0001\n\n\u0006\b\u0001\u0013\u00176@ , and the basis\n\n\t was obtained.\n\n\t\u000e\n\nR/R\n\n\u0006\u0019\u0001\n\n\u0006\b\u0001\n\n\u0001\u000b\n\n\u0005\b\u0007\n\n4.2 Binary classi\ufb01cation and Boosting\n\nis a discrete variable. Given a continu-\n\n\t , we consider the following prediction rule:\n\n&\u0019\u0018\nif\u0004\nif\u0004\n\nIn binary classi\ufb01cation, the output value \nous model\u0004\n\r\u001b\u001a\n\n\t\u001d\u001c\nThe classi\ufb01cation error (we shall ignore the point \u0004\nif\u0004\nif\u0004\n\nrarely) can be given by\n\n\t\u000e\n\n\u0006\b\n\n354\n\n\u001f\u001e\n\n4 , which is assumed to occur\n\n\u000454\n\nUnfortunately, this classi\ufb01cation error function is not convex, which cannot be handled in\nour formulation. In fact, even in many other popular methods, such as logistic regression\nand support vector machines, some kind of convex formulations have to be employed.\n\n\u0002\n\u0011\n\u0013\n#\n\u0015\n\u0001\n\n\u0006\n\u0011\n*\n\u0017\n\u0013\n\u0004\n\u0013\n\t\n\t\n$\n\u0011\n*\n\u0016\n\u0003\n\u0017\n\u0013\n\u0003\n\u0006\n\u001e\n\n\u0004\n\t\n1\n\u0003\n\u0006\n\u0003\n\u0006\n\n@\nR\n\n\n4\n\n\n\u001d\n\n\u001e\n\n\u0002\n\u0004\n$\n\u0004\n\t\n$\nR\n\u0011\n\u000b\n\u0017\n\u0011\n\u000b\n\f\n\n\u0003\n\u0006\nR\n\u0006\n\u0011\n\u0013\n#\n\u0015\n\u0001\n\n\u0006\n\u0011\n*\n\u0017\n\u0011\n\u0013\n\u0004\n\u0013\n\t\n\t\n$\n\u0004\n\n\u0002\n#\n\n\u0011\n\u0014\n\u0011\n\u0013\n#\n\u0015\n\u0001\n\n\u0006\n(\n*\n\u0017\n+\n\u0004\n\u0013\n+\n\t\n\t\n$\n\n\u0004\n$\n\u001e\n\n\u0011\n\u0013\n\u0004\n\t\n$\n\u0007\n3\n\n\u0001\n\u0016\n\u0006\nR\n\u0007\n\u0001\n\u0001\n\u0013\n\u0003\n\u001f\n\u0003\n@\n\u0001\n\u0001\n\n\u0003\n\u0001\n\u0001\n\t\n\u0006\n\u0006\n\u0003\n\u0001\n\u0001\n4\nR\n\u0001\n\u0001\n\t\n\n\t\n\u0001\n\u0004\n\u0001\n\u0001\n\t\n\u001a\n\u0003\n\u0001\n\u0001\n\t\n\n\u0006\n4\n\u0001\n\u0001\n\t\n4\nR\n\fAlthough it is possible for us to analyze their formulations, in this section, we only consider\nthe following form of loss that is closely related to Adaboost [2]:\n\n\u0001\u0010\u000f\n\n\t\u000e\r\u0001\n\n\u0001\u001c\u0006\n\n\u0015\u0003\u0002\u0005\u0004\n\u0001\f\n\nis a scaling factor.\n\nwhere\u0004\nAgain, we consider a set of basis predictors\u0004\nlearners in the boosting literature. We would like to \ufb01nd a strong learner \u0004\n\ncombination of weak learners to approximately minimize the above loss:\n\n\u000321 , which are often called weak\n\n\t as a convex\n\n\u0006\b\u0001\n\n\u0001\u001c\u0006\n\n\u0002\u0006\u0004\n\n\u0001\u000b\n\n\u0006\b\u0001\n\n\u0013\u0015\u0014\n\n354\n\nC\u0015D7E\n\ns.t.\n\n\u0013\u001b\u0014\u0007\u0016\n\r\u0010&\n\nThis can be written as formulation (1) with\n\nUsing simple algebra, it is easy to verify that\n\n\u0001\u000b\n\n\u0006\b\u0001\n\n\u0001\u001c\u0006\n\n(3)\n\n(4)\n\n(5)\n\n\u0011 ,\n\n(6)\n\nWe start with \u0017\nrepresented as weight \u0017\n\n\u0002\u0006\u0004\n\n\u0013\u001b\u0014\u0007\u0016\n\n\u001f\"!\nHKJML$N\u001bOBP\nin Algorithm 2.1. Theorem 3.1 implies that the sparse solution Q\n\u0011 and \n\n\u0013 (\n\u0001\f\n\n\u0006\u0019\u0001\n\n\u0007 ), satis\ufb01es the following inequality:\n\n\u0002\u0005\u0004\n\u0015\u0003\u0002\u0006\u0004\nR/R\nC\u0015DFE\n\n\u0006\u0005\n\n\u0001\u001c\u0006\n\n\u0006\b\u0001\n\n\u0001\f\n\n\u0002\u0006\u0004\n\nin the above inequality is non-negative. Now we consider the\n\nThis condition will be satis\ufb01ed in the large margin linearly separable case where there exists\n\n\u0001\u001c\u0006\n\n\u0002\u0005\u0004\n\n\t$R\n\n(7)\n\nfor all \u0007\nspecial situation that there exists\n\n\u0003 . Weight \u0017\nC\u001bDFE\n\n4 such that\n\n\b\t\u001e\n\n\u0001\u001c\u0006\n\n\u0002\u0005\u0004\n\n354\n\n+ and\n\n4 such that\n\nNow, under (7), we obtain from (6) that\n\n+\u0019\u0014\u0007\u0016\n\n+\u0019\u0014\u0007\u0016\n\n\t ,\n\n\u0006\b\n\n\u0001\u000b\n\n\u0006\b\u0001\n\n+\u0019\u0014\u0007\u0016\n\u0003 and for all data \u0001\n\u0001\f\n\n\u0006\u0019\u0001\n\nFix any \u0007\n\n\u0013\u001b\u0014\u0007\u0016\n\u0003 , we can choose\u0004\n\n\u0013\u0015\u0014\n\n\u0001\f\n\n\u0001\u000b\n\n\u0006\u0019\u0001\n\n\u0006\b\u0001\n\n\t$R\n\n\u0001\u001c\u0006\n\n\u0002\u0006\u0004\n\nto obtain\n\n\u0001\u001c\u0006\n\n\u0002\u0005\u0004\n\n\u0006\u0005\n\n\u0006\u0005\n\n\u0003\u0016\u0015\n\n(8)\n\nThis implies that the misclassi\ufb01cation error rate decays exponentially. The exponential de-\ncay of misclassi\ufb01cation error is the original motivation of Adaboost [2]. Boosting was later\n\n\u001c\n\u0001\n\u0004\n\t\nD\n\u0001\n\u0011\n\u0013\n#\n!\n\u0004\n\u0004\n\u0001\n\u0001\n\t\n\n\t\n\t\n\u0006\n\t\n\u001f\n0\n\u0006\n\u0003\n\u0006\n\u0001\n\u0001\n\u0002\n\nD\n\u0001\n\u0011\n\u0013\n#\n\u0015\n!\n\u0004\n\u0011\n*\n\u0016\n\u0017\n\u0013\n\u0004\n\u0013\n\t\n\n\t\n\t\n\u0011\n*\n\u0017\n\u0013\n\u0004\n\u0003\n\u0006\n\u0017\n\u0013\nR\n\u001e\n\u0006\n\u0004\n\t\n1\n4\n\u0004\n\u0006\n\u0004\n\u0003\n@\nR\n\u001d\n\u0004\n\u001e\n\u0007\n#\n\u0007\n\u0011\n\u0004\n$\n\u0011\n\u0013\n#\n\u0015\n\u0001\n!\n\u0004\n\u0004\n\u0013\n\u0001\n\u0001\n\t\n\n\t\n\u0004\n\u0001\n\u0001\n\t\n$\n\t\n\u0011\n\u0013\n#\n!\n\u0001\n\u0006\n\u0004\n\u0004\n\u0013\n\u0001\n\u0001\n\t\n\n\t\n\u0004\n\u0004\n$\nR\n\n\n4\n\f\n\n\u0003\n\u0006\nR\n\u0006\n\u0011\n\u0013\n#\n\u0015\n!\n\u0001\n\u0006\n\u0004\n\u0011\n*\n\u0017\n\u0011\n\u0013\n\u0004\n\u0013\n\t\n\n\t\n\u0004\n\n\u0002\n\n\u000e\n\u0010\n\u0016\n#\n\n\u0011\n\u0014\n\u0011\n\u0013\n#\n\u0015\n!\n\u0004\n(\n*\n\u0017\n+\n\u0004\n\u0013\n+\n\t\n\n\n/\n\u0004\n$\n\u0007\n\t\n3\n\n\u0002\n\n\u000e\n\u0010\n\u0016\n#\n\n\u0011\n\u0014\n\u0011\n\u0013\n#\n\u0015\n!\n\u0004\n(\n*\n\u0017\n+\n\u0004\n\u0013\n+\n\t\n\n\t\n\u0004\n!\n\u0002\n\b\n\u0004\n\u0017\n+\n\u0006\n\n\u0013\n\b\n\u001e\n\u000b\n\u0017\n\u000b\n\u0016\n\u0004\n\u0001\n(\n*\n\u0017\n+\n\u0004\n\u0013\n+\n\t\n\n3\n\u0002\n\b\nR\n\t\n\u0001\n\u0011\n*\n\u0017\n\u0011\n\u0013\n\u0004\n\u0013\n\t\n\n\u0004\n\b\n\t\n\u0004\n!\n\u0004\n\b\n\n/\n\u0004\n$\n\u0007\n\n\u0005\n3\n\n\b\n\u0001\n\u0007\n\t\n\u0005\n\t\n\u0001\n\u0011\n*\n\u0016\n\u0017\n\u0011\n\u0013\n\u0004\n\u0013\n\t\n\n\u0004\n\b\n\t\n\u0004\n!\n\b\n$\n\u0001\n\u0007\n\t\n\u0005\n\u0005\n\u0002\n\t\nR\n\fviewed as greedy approximation in the additive model framework [3]. From the learning\ntheory perspective, the good generalization ability of boosting is related to its tendency to\nimprove the misclassi\ufb01cation error under a positive margin [8]. From this point of view,\ninequality (8) gives a much more explicit margin error bound (which decreases exponen-\ntially) than a related result in [8].\n\nIn the framework of additive models, Adaboost corresponds to the exponential loss (3)\nanalyzed in this section. As pointed out in [3], other loss functions can also be used.\nUsing our analysis, we may also obtain sparse approximation bounds for these different\nloss functions. However, it is also easy to observe that they will not lead to the exponential\ndecay of classi\ufb01cation error in the separable case. Although the exponential loss in (3) is\nattractive for separable problems due to the exponential decay of margin error, it is very\nsensitive to outliers in the non-separable case.\n\nWe shall mention that an interesting aspect of boosting is the concept of adaptive resam-\npling or sample reweighting. Although this idea has dominated the interpretation of boost-\ning algorithms, it has been argued in [3] that adaptive resampling is only a computational\nby-product. The idea corresponds to a Newton step approximation in the sparse greedy\nin Algorithm 2.1 under the additive model framework which we consider\nhere. Our analysis further con\ufb01rmed that the greedy sparse solution of an additive model\nin (1), rather than reweighting itself is the key component in boosting. In our framework,\nit is also much easier to related the idea of boosting to the greedy function approximation\nmethod outlined in [1, 5].\n\nsolution of \u0001\r\f\b\t\n\n4.3 Mixture density estimation\n\n. The following negative log-likelihood is commonly used as loss function:\n\nis the probability density function of the input\n\nIn mixture density estimation, the output \nvector at \u0001\n\t\u000e\r\nwhere\u0004\nAgain, we consider a set of basis predictors \u0004\nponents. We would like to \ufb01nd a mixture probability density model \u0004\n\nis a probability density function.\n\n\u0006\u0019\u0001\n\n\u0001\u0010\u000f\n\n\u0001\f\n\nbination of mixture components to approximately minimize the negative log-likelihood:\n\n\t , which are often called mixture com-\n\t as a convex com-\n\nC\u001bDFE\n\n\u0013\u0015\u0014\u0007\u0016\nThis problem was studied in [7]. The quantity \u001d\n\n\u0013\u001b\u0014\u0007\u0016\n\ns.t.\n\nas:\n\n\u0001\f\n\n\u0006\b\u0001\n\u0013%3\n\n(9)\n\n(10)\n\nde\ufb01ned in Lemma 3.1 can be computed\n\n\u001e \u001f\"!\nN\u0002\u0001\n\n\u0004\u0003\n\nN\u0002\u0001\n\nHKJML/N\u001bOBP\n\n\u0013\u0006\u0005\n\n\u001e \u001f\"!\n\n\u0011\u0014\u0013\n\n\u0001\u000b\n\n\u0001\u000b\n\n\u0006\u0019\u0001\n\u0006\u0019\u0001\n\nAn approximation bound can now be directly obtained from Theorem 3.1. It has a form\nsimilar to the bound given in [7].\n\n5 Conclusion\n\nThis paper studies a formalization of a general class of prediction problems in machine\nlearning, where the goal is to approximate the best model as a convex combination of\n\n\u001c\n\u0001\n\u0004\n\t\n\u0006\n\u0011\n\u0013\n\nD\n\u0004\n\u0001\n\u0001\n\t\n\u0006\n\u0001\n\u0001\n\t\n3\n4\n\u0001\n\u0001\n\u0002\n\u0006\n\u0011\n\u0013\n\nD\n\u0001\n\u0011\n*\n\u0017\n\u0013\n\u0004\n\u0013\n\t\n\n\t\n\t\n\u0011\n*\n\u0017\n\u0013\n\n\u0003\n\u0006\n\u0017\n4\nR\n\u001d\n\n\n\u000e\nP\n#\nP\n\u0011\n\u0016\n\u0001\n\u0001\n\t\n$\n\u0005\n$\n\u0001\n\u0001\n\t\n$\n\n\u000e\n#\n\n\u0003\n\u0004\n\u0016\n\t\n$\n\u0004\n$\n\t\n$\nR\n\fa family of basic models. The quality of the approximation can be measured by a loss\nfunction which we want to minimize.\n\nWe proposed a greedy algorithm to solve the problem, and we have shown that for a variety\nof loss functions, a convergence rate of \u0002\u0001\n\u0007\n\t can be achieved using a convex combina-\ntion of \u0007 basic models. We have illustrated the consequence of this general algorithm in\nregression, classi\ufb01cation and density estimation, and related the resulting algorithms to\nprevious methods.\n\nReferences\n\n[1] A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal func-\n\ntion. IEEE Transactions on Information Theory, 39(3):930\u2013945, 1993.\n\n[2] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning\n\nand an application to boosting. J. Comput. Syst. Sci., 55(1):119\u2013139, 1997.\n\n[3] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression:\nA statistical view of boosting. The Annals of Statistics, 28(2):337\u2013407, 2000. With\ndiscussion.\n\n[4] T. J. Hastie and R. J. Tibshirani. Generalized additive models. Chapman and Hall\n\nLtd., London, 1990.\n\n[5] Lee K. Jones. A simple lemma on greedy approximation in Hilbert space and con-\nvergence rates for projection pursuit regression and neural network training. Ann.\nStatist., 20(1):608\u2013613, 1992.\n\n[6] Wee Sun Lee, P.L. Bartlett, and R.C. Williamson. Ef\ufb01cient agnostic learning of\nIEEE Transactions on Information Theory,\n\nneural networks with bounded fan-in.\n42(6):2118\u20132132, 1996.\n\n[7] Jonathan Q. Li and Andrew R. Barron. Mixture density estimation. In S.A. Solla, T.K.\nLeen, and K.-R. M\u00a8uller, editors, Advances in Neural Information Processing Systems\n12, pages 279\u2013285. MIT Press, 2000.\n\n[8] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the\nmargin: a new explanation for the effectiveness of voting methods. Ann. Statist.,\n26(5):1651\u20131686, 1998.\n\n[9] Alex J. Smola and Peter Bartlett. Sparse greedy Gaussian process regression.\n\nAdvances in Neural Information Processing Systems 13, pages 619\u2013625, 2001.\n\nIn\n\n[10] Tong Zhang. Some sparse approximation bounds for regression problems. In The\n\nEighteenth International Conference on Machine Learning, pages 624\u2013631, 2001.\n\n\u0003\n\u0005\n\f", "award": [], "sourceid": 2051, "authors": [{"given_name": "T.", "family_name": "Zhang", "institution": null}]}