*This post is part of the lecture notes of my class “Introduction to Online Learning” at Boston University, Fall 2019.*

* You can find all the lectures I published here.*

Last time, we saw that we can use Follow-The-Regularized-Leader (FTRL) on linearized losses:

Today, we will show a number of applications of FTRL with linearized losses, some easy ones and some more advanced ones.

As a remainder, the regret upper bound for FTRL for linearized losses that we proved last time for the case that the is -strongly convex w.r.t. for is

We also said that we are free to choose , so we will often set it to .

Remark 1As we said last time, the algorithm is invariant to any positive constant added to the regularizer, hence we can always state the regret guarantee with instead of . However, for clarity in the following we will instead explicitly choose the regularizer such that their minimum is 0.

**1. FTRL with Linearized Losses Can Be Equivalent to OMD **

First, we see that even if FTRL and OMD seem very different, in certain cases they are equivalent. For example, consider that case that . The output of OMD is

Assume that for all . This implies that , that is . Assuming , we have

On the other hand, consider FTRL with linearized losses with regularizers , then

Assuming that , this implies that . Further, assuming that is invertible, implies that the predictions of FTRL and OMD are the same.

This equivalence immediately gives us some intuition on the role of in both algorithm: The same function is inducing the Bregman divergence, that is our similarity measure, and is the regularizer in FTRL. Moreover, the inverse of the growth rate of the regularizers in FTRL takes the role of the learning rate in OMD.

Example 1Consider and , then it satisfies the conditions above to have the predictions of OMD equal to the ones of FTRL.

**2. Exponentiated Gradient with FTRL: No Need to know **

Let’s see an example of an instantiation of FTRL with linearized losses to have the FTRL version of Exponentiated Gradient (EG).

Let and the sequence of loss functions be convex and -Lipschitz w.r.t. the L-infinity norm. Let defined as , where and we define . Set , that is -strongly convex w.r.t. the L1 norm, where is a parameter of the algorithm.

Given that the regularizers are strongly convex, we know that

We already saw that , that implies that . So, we have that

Note that this is exactly the same update of EG based on OMD, but here we are effectively using time-varying learning rates.

We also get that the regret guarantee is

where we used the fact that using and are equivalent. Choosing . This regret guarantee is similar to the one we proved for OMD, but with an important difference: We don’t have to know in advance the number of rounds . In OMD a similar bound would be vacuous because it would depend on the that is infinite.

**3. Composite Losses **

Let’s now see a variant of the linearization of the losses: *partial linearization of composite losses*.

Suppose that the losses we receive are composed by two terms: one convex function changing over time and another part is fixed and known. These losses are called *composite*. For example, we might have . Using the linearization, we might just take the subgradient of . However, in this particular case, we might lose the ability of the L1 norm to produce sparse solutions.

There is a better way to deal with these kind of losses: Move the constant part of the loss inside the regularization term. In this way, that part will not be linearized but used exactly in the argmin of the update. Assuming that the argmin is still easily computable, you can always expect better performance from this approach. In particular, in the case of adding an L1 norm to the losses, you will be predicting in each step with the solution of an L1 regularized optimization problem.

Practically speaking, in the example above, we will define , where we assume to be 1-strongly convex and the losses be -Lipschitz. Note that we use at time a term because we anticipate the next term in the next round. Given that is -strongly convex, using (1), we have

where . Reordering the terms, we have

Example 2Let’s also take a look at the update rule in that case that and we get composite losses with the L1 norm. We have

We can solve this problem observing that the minimization decomposes over each coordinate of . Denote by . Hence, we know from first-order optimality condition that is the solution for the coordinate iff there exists such that

Consider the 3 different cases:

- , then and .
- , then and .
- , then and .
So, overall we have

Observe as this update will produce sparse solutions, while just taking the subgradient of the L1 norm would have never produced sparse predictions.

Remark 2 (Proximal operators)In the example above, we calculated something like

This operation is known in the optimization literature as

Proximal Operatorof the L1 norm. In general, a proximal operator of a convex, proper, and closed function is defined asProximal operators are used in optimization in the same way as we used it: They allow to minimize the entire function rather a linear approximation of it. Also, proximal operators generalizes the concept of Euclidean projection. Indeed, .

**4. FTRL with Strongly Convex Functions **

Let’s now go back to the FTRL regret bound and let’s see if you can strengthen it in the case that the regularizer is *proximal*, that is it satisfies that .

Lemma 1Denote by . Assume that is not empty and set . Also, assume that is -strongly convex w.r.t. and convex, and the regularizer is such that . Also, assume that is non-empty. Then, we have

*Proof:* We have

where in the second inequality we used Corollary 1 from last lecture, the fact that , and . Observing that from the proximal property, we have that , . Hence, using the theorem of the subdifferential of sum of functions, and remembering that , we can choose such that we have .

Remark 3Note that a constant regularizer is proximal because any point is the minimizer of the zero function. On the other hand, a constant regularizer makes the two Lemmas the same,unlessthe loss functions contribute to the total strong convexity.

We will now use the above lemma to prove a logarithmic regret bound for strongly convex losses.

Corollary 2Let be strongly convex w.r.t. , for . Set the sequence of regularizers to zero. Then, FTRL guarantees a regret of

The above regret guarantee is the same of OMD over strongly convex losses, but here we don’t need to know the strong convexity of the losses. In fact, we just need to output the minimizer over the past losses. However, as we noticed last time, this might be undesirable because now each update is an optimization problem.

Hence, we can again use the idea of replacing the losses with an easy *surrogate*. In the Lipschitz case, it made sense to use linear losses. However, here you can do better and use *quadratic* losses, because the losses are strongly convex. So, we can run FTRL on the quadratic losses , where . The algorithm would be the following one:

To see why this is a good idea, consider the case that the losses are strongly convex w.r.t. the L2 norm. The update now becomes:

Moreover, we will get exactly the same regret bound as in Corollary 2, with the only difference that here the guarantee holds for a specific choice of the rather than for any subgradient in .

Example 3Going back to the example in the first lecture, where and are strongly convex, we now see immediately that FTRL without a regularizer, that is Follow the Leader, gives logarithmic regret. Note that in this case the losses were defined only over , so that the minimization is carried over .

**5. History Bits **

The first analysis of FTRL with composite losses is in (L. Xiao, 2010). The analysis presented here using the negative terms to easily prove regret bounds for FTRL for composite losses is from (F. Orabona and K. Crammer and N. Cesa-Bianchi, 2015).

The first proof of FTRL for strongly convex losses was in (S. Shalev-Shwartz and Y. Singer, 2007) (even if they don’t call it FTRL).

There is an interesting bit about FTRL-Proximal (McMahan, H. B., 2011): FTRL-Proximal is an instantiation of FTRL that uses a particular proximal regularizer. It became very famous in internet companies when Google disclosed in a very influential paper that they were using FTRL-Proximal to train the classifier for click prediction (McMahan, H. B. and Holt, G. and Sculley, D. and Young, M. and Ebner, D. and Grady, J. and Nie, L. and Phillips, T. and Davydov, E. and Golovin, D. and Chikkerur, S. and Liu, D. and Wattenberg, M. and Hrafnkelsson, A. M. and Boulos, T. and Kubica, J., 2013). This generated even more confusion because many people started conflating the term FTRL-Proximal (a specific algorithm) with FTRL (an entire family of algorithms). Unfortunately, this confusion is still going on in these days.

**6. Exercises **

Exercise 1Prove that the update in (2) is equivalent to the one of OSD with and learning rate .

What do you mean exactly by “proximal regularizer”? I imagined that one that includes euclidean distance from x_t but even if it were true I don’t know how to derive x_t \in argmin_x \psi_{t+1}(x) – \psi_t(x).

LikeLike

It is just a technical condition introduced by Brendan McMahan. There are not many algorithms that actually have a proximal regularizers, but constant regularizer is proximal. So, I used it because it allows me to deal with a bunch of different cases with one Lemma. In the note on Online Newton Step, for example, I used this characterization again.

LikeLike