Part of Advances in Neural Information Processing Systems 19 (NIPS 2006)

*Shai Shalev-shwartz, Yoram Singer*

We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization.

1

Introduction and Problem Setting

Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the first player chooses a vector wt from a convex set S . Next, the second player responds with a convex function gt : S R. Finally, the first player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the first t player. The goal of the first player is to minimize its cumulative loss, gt (wt ). To motivate this rather abstract setting let us first cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner first receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X Y , where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, -1} where +1 stands for a spam email and -1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w S }. For example, the set of linear classifiers, which is used for answering yes/no questions, is defined as H = {hw (x) = sign( w, x ) : w Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S . We can therefore redefine the online learning process as follows. On round t, the learner chooses a vector wt S , which defines a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the first player using the notion of regret. Given a number of rounds T and a fixed vector u S , we define the regret of the first player as the excess loss for

not consistently playing the vector u, T T 1t 1t gt (wt ) - gt (u) . T =1 T =1

Our main result is an algorithmic framework for the first player which guarantees low regret with respect to any vector u S . Specifically, we derive regret bounds that take the following form u S, T T 1t 1t f (u) + L gt (wt ) - gt (u) , T =1 T =1 T

(1)

where f : S R and L R+ . Informally, the function f measures the "complexity" of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Specifically, we rewrite Eq. (1) as follows T T 1t 1t f (u) + L gt (wt ) inf gt (u) + . uS T T =1 T =1

(2)

That is, the average loss of the first player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the "complexity" of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the first player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the first player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Specifically, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16].

2

Mathematical Background

We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S ). The set of non-negative real numbers is denoted by R+ . For any k 1, the set of integers {1, . . . , k } is denoted by [k ]. A norm of a vector x is denoted by x . The dual norm is defined as = sup{ x, : x 1}. For iexample, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = |xi |, is dual to the norm, x = maxi |xi |. We next recall a few definitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is

convex if for any two vectors w1 , w2 in S , all the line between w1 and w2 is also within S . That is, for any [0, 1] we have that w1 + (1 - )w2 S . A set S is open if every point in S has a neighborhood lying in S . A set S is closed if its complement is an open set. A function f : S R is closed and convex if for any scalar R, the level set {w : f (w) } is closed and convex. The Fenchel conjugate of a function f : S R is defined as f ( ) = supwS w, - f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and we have that f (w) + f ( ) w, . A vector is a sub-gradient of a function f at w if for all w S we have that f (w ) - f (w) w - w, . The differential set of f at w, denoted f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the definition of Fenchel conjugate. In particular, the following lemma states that if f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let f (w ) be its differential set at w . Then, for all f (w ) we have, f (w ) + f ( ) = , w . A continuous function f is -strongly convex over a convex set S with respect to a norm if S is contained in the domain of f and for all v, u S and [0, 1] we have 1 f ( v + (1 - ) u) f (v) + (1 - ) f (u) - (1 - ) v - u 2 . (3) 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let be a norm over Rn and let be its dual norm. Let f be a -strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f ( ) = arg maxxS , x - f (x). Furthermore, for any , Rn we have 1 f ( + ) - f ( ) f ( ), + 2. 2 Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w 2 is 1-strongly convex over S = Rn with respect to the 2 2 norm. Its conjugate function is f ( ) = 1 2 . 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f ( ) = log( n i=1 exp(i )).

3

Generalized Fenchel Duality

In this section we derive our main analysis tool. We start by considering the following optimization problem, c , T inf f (w) + t=1 gt (w) wS

where c is a non-negative scalar. An equivalent problem is c s T .t. w0 S and t [T ], wt = w0 . inf f (w0 ) + t=1 gt (wt ) w0 ,w1 ,...,wT

Introducing T vectors 1 , . . . , T , each t Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , 1 , . . . , T ) = c f (w0 ) + t=1 gt (wt ) + t=1 t , w0 - wt . The dual problem is the task of maximizing the following dual objective value, D(1 , . . . , T ) = inf L(w0 , w1 , . . . , wT , 1 , . . . , T ) w0 S,w1 ,...,wT w -T T 1 = - c sup t - f (w0 ) 0, - c t=1 t=1 sup ( wt , t - gt (wt )) wt w0 S -T T -1 t = -c f t=1 t t=1 g (t ) , c

where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f , g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is -T T t sup - c f - 1 t=1 t (4) t=1 g (t ) . c 1 ,...,T

Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality.

4

A Template Learning Algorithm for Convex Repeated Games

In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the first player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows c , tm tT f (u) + gt (u) (5) gt (wt ) - c L inf =1 uS =1

where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best fixed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution 1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(1 , . . . , 1 ) = 0. We assume in addition that f is 1 T -strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector . T (6) wt = f - 1 i=1 t i c After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by t the differential set of gt at wt , that is, t = { : w S, gt (w) - gt (wt ) , w - wt } . (7)

The new dual variables (t+1 , . . . , t+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). t t s.t. D(t+1 , . . . , T+1 ) D(t , . . . , t-1 , , t+1 , . . . , t ) 1 t t T 1

(ii). i > t, t+1 = 0 i

Do not remove: This comment is monitored to verify that the site is working properly