Paper ID: | 254 |
---|---|

Title: | Stochastic Online AUC Maximization |

The paper studies optimizing AUC with a square loss for classification. By formulating the (expected) AUC optimization as a convex-concave saddle point problem, the paper proposes a stochastic online algorithm (SOLAM), which has one-datum space and per-iteration complexities. Theoretical convergence with high probability are proved and numerical examples on standard benchmark datasets are carried out to show SOLAM's efficiency.

The proposed algorithm is very interesting and has advantages on space and computational complexities, in comparing with state-of-the-art algorithms. The convergence rate of the derived result is optimal and the proof is technically sound. The paper is well written and easy to follow. Minor comments: 1. The choices on the parameters $R$ and the step-size are crucial (to achieve best performance) and in practice they are chosen using a cross-validation method, as shown in the numerical examples. It would be useful to make such a remark in the analysis section. 2. Theoretical results are established for decaying step-size setting while numerically, a constant step-size setting is considered in the paper. Thus, it would be better to make a remark after Corollary 1, to clarify that convergence result can also be established when an appropriate constant step-size is considered. 3. According to (14), one should also consider to tune the parameter $\kappa$ when running the algorithm.? 4. Will running with multiple passes over the data helps to improve the performances of the proposed algorithm? 5. Typos:\\ line 12-13, double meaning on `effectiveness'\\ line 33, `and has O(td) space and per-iteration complexity for d-dimensional data' is not clear to me.\\ line 76, $p$ is not introduced here.\\ line 88, $F(\mu,\alpha) $ should be $F(\mu,\alpha,\xi)$. \\ line 92, $R^d \times R^3$ should be $R^d \times R^3 \times \mathcal{Z}$. \\ line 91, $a$ stochastic SPP...\\ line 140, $\epsilon_f(\bar{u}_T)$ should be $\epsilon_f(\bar{v}_T,\bar{\alpha}_t)$.\\ Eq. (3) $I_{[y'=-1]}$.\\ line 44, $\Omega$ $is$ a convex set.\\ line 155, double `of'

2-Confident (read it all; understood it all reasonably well)

Area under ROC (AUC) is a metric which is widely use for measuring the classification performance for imbalanced data. This paper develops a online learning algorithm that Maximizes AUC for large-scale data. Specifically, this paper transfer AUC optimization problem into a saddle point problem, which makes AUC optimization consist of instance based loss function. This change makes it is possible to propose highly efficient (both on space and computation) online learning algorithms.

Firstly, AUC is a very important metric for many problems. So it is also very important to propose efficient AUC optimization algorithms. This algorithm is the first online AUC optimization algorithm, which enjoys one-datum spae and per-iteration complexities. So, it can affect many practical applications. Minor: Suggest that it is better to provide

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

The key contribution of this paper is to convert the AUC maximization via the least square surrogate loss to a stochastic saddle point problem (Theorem 1), where the loss defined on pairwise comparisons across different classes can be converted to a single sample average. With this, a novel online learning algorithm is proposed based on stochastic gradient method. The main advantage of this algorithm lies in its linear O(d) storage and per-iteration complexities, whereas other approaches typically need to store all previous examples or its second-order O(d^2) covariance matrix.

The conversion method of Stochastic Saddle Point seems only applied to the least square loss. How about other loss functions?

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

This paper proposes an online algorithm for AUC maximization with low per-iteration space and time complexity. The numerical experiments also suggest the performance of the proposed algorithm is comparable with existing algorithms with higher complexity.

1. I doubt whether the first equality of (8) holds: In the first equality, the authors were assuming E[(wx)(wx') | y=1, y'=-1] = E[wx | y=1] E[wx' | y'=-1]. I guess the authors believe this is true because (x,y) and (x',y') are independent samples. However, two independent random variables are not necessarily conditional independent. For instance, let X1 and X2 be two i.i.d. sample from distribution X= 1 with prob 1/2 and X=-1 with prob 1/2. Then, conditional on X1 = X2, they are no longer independent. Note E[X1 X2 | X1=X2] = 1 while E[X1 | X1 = X2] = E[X2| X1=X2] = 0. Of course, my example above does not imply this paper could not have the independence for wx and wx' conditional on (y=1, y=-1). Maybe the authors have a different reason for the expectation identity, e.g., conditional uncorrelated? The authors should justify this step. 2. Related to the above, it is very surprising that the independence between (x,y) and (x', y') disappear in the saddle point function. Remember that the argument in (8) requires the independence of (x,y) and (x, y'). But in the end when everything is written as integration, this information is gone. The authors just treat two integration variables as the same thing! This is how they got equation (6). However, by doing so, the physical meaning of indicator functions inside the integration are changed. Some of the indicator functions should be defined in one label y and some should be defined in another label y'. (So when you run an online gradient type alg, you need two samples to evaluate the point gradient.) But this paper just define all indicator functions in terms of the same y at the end of the proof for Thm 1. That's why their alg only need one sample to have online update. So I think even if (8) is correct, the authors can not simply write two expectations conditional on two different y as integrations with respect to the same variable. The original utility (4) by nature involves two samples and so should the reformulation in Thm 1. 3 Another comment on the proof of Thm 1 is: The saddle point argument from equation (8) in the proof for Thm 1 can be better explained. If we define the objective function in (3) as h(w), then the proof starting from (8) is to show "by introducing other (dummy) variables, h(w) is equal to min_{a,b} max_{alpha} F(w,a,b, alpha) for any w." (Although, given w, the minimizing and maximizing points (a,b) and (alpha) are actually fixed, F is preferred because expectations conditional on two labels are now decoupled as the sum of two expectations conditional on a single label.) Now if we minimize w on both sides, min_w h(w) is equivalent to min_{w,a,b} max_{alpha} F(w,a,b, alpha) and the solutions with respect to w on both sides are the same. The solution w to min_{w,a,b} max_{alpha} F(w,a,b, alpha) is given by a saddle point because F is convex-concave function.

2-Confident (read it all; understood it all reasonably well)

The authors provide a novel stochastic algorithm for Online AUC Maximization (non-adversarial setting). The challenge this paper addresses is that AUC is defined over a pair of examples and past approaches have required use of buffers in the online setting. The authors' main contribution is a new formulation of AUC maximization as a convex-concave saddle point problem. By using stochastic gradient descent/ascent, the authors achieve a per-iteration complexity to one example. Further, they go on to show that the error falls as O (1 / \sqrt T) for T iterations (up to log factors). Experiments are performed to compare the proposed method against existing methods.

The idea of formulating AUC Maximization as SPP is novel but as stated above, the experiments section could be improved. - For the experiments: how many epochs were used? What was the mini-batch size? Has the optimization converged for the results presented? - Can you rationalize the differences in performance for different methods? In particular, how does SOLAM do better than the offline methods for acoustic dataset? More feedback, for better reading: - p is used in line 76, much before it is defined in line 81. - The range of f is restricted to {-1, +1} (in line 70) but for the rest of the work, f(x) = w'x. Optional changes: - Simplify line 96 as Var(w'x | y = 1) for easier reading. - line 135: dual gap -> duality gap - Proof of theorem 2 is hard to parse and read. It would be great if you could fix it.

2-Confident (read it all; understood it all reasonably well)

The authors take the problem of Area Under Curve (AUC) optimization and pose it as a Stochastic Saddle Point Problem. Their main contribution is to provide a provable online algorithm that gives speed-ups in terms of storage and therefore the time it takes to run the algorithm. They compare it to some competing algorithms on 12 datasets and show the time speedup that they expected.

The authors seem to have done a good job highlighting their novelty and contributions. I will only note some typos and some minor issues. 1) On page 3 line 74, it should be z'=(x',y') and not z = (x,y) 2) Figure 1 is very small compared to the table a little hard to read. It would be nice to get a more zoomed in view to get a better picture of the state of the other algorithms in contention.

2-Confident (read it all; understood it all reasonably well)