{"title": "Dual Space Gradient Descent for Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4583, "page_last": 4591, "abstract": "One crucial goal in kernel online learning is to bound the model size. Common approaches employ budget maintenance procedures to restrict the model sizes using removal, projection, or merging strategies. Although projection and merging, in the literature, are known to be the most effective strategies, they demand extensive computation whilst removal strategy fails to retain information of the removed vectors. An alternative way to address the model size problem is to apply random features to approximate the kernel function. This allows the model to be maintained directly in the random feature space, hence effectively resolve the curse of kernelization. However, this approach still suffers from a serious shortcoming as it needs to use a high dimensional random feature space to achieve a sufficiently accurate kernel approximation. Consequently, it leads to a significant increase in the computational cost. To address all of these aforementioned challenges, we present in this paper the Dual Space Gradient Descent (DualSGD), a novel framework that utilizes random features as an auxiliary space to maintain information from data points removed during budget maintenance. Consequently, our approach permits the budget to be maintained in a simple, direct and elegant way while simultaneously mitigating the impact of the dimensionality issue on learning performance. We further provide convergence analysis and extensively conduct experiments on five real-world datasets to demonstrate the predictive performance and scalability of our proposed method in comparison with the state-of-the-art baselines.", "full_text": "Dual Space Gradient Descent for Online Learning\n\nTrung Le, Tu Dinh Nguyen, Vu Nguyen, Dinh Phung\n\nCentre for Pattern Recognition and Data Analytics\n\n{trung.l, tu.nguyen, v.nguyen, dinh.phung}@deakin.edu.au\n\nDeakin University, Australia\n\nAbstract\n\nOne crucial goal in kernel online learning is to bound the model size. Common\napproaches employ budget maintenance procedures to restrict the model sizes using\nremoval, projection, or merging strategies. Although projection and merging, in the\nliterature, are known to be the most effective strategies, they demand extensive com-\nputation whilst removal strategy fails to retain information of the removed vectors.\nAn alternative way to address the model size problem is to apply random features\nto approximate the kernel function. This allows the model to be maintained directly\nin the random feature space, hence effectively resolve the curse of kernelization.\nHowever, this approach still suffers from a serious shortcoming as it needs to use a\nhigh dimensional random feature space to achieve a suf\ufb01ciently accurate kernel\napproximation. Consequently, it leads to a signi\ufb01cant increase in the computational\ncost. To address all of these aforementioned challenges, we present in this paper\nthe Dual Space Gradient Descent (DualSGD), a novel framework that utilizes\nrandom features as an auxiliary space to maintain information from data points\nremoved during budget maintenance. Consequently, our approach permits the\nbudget to be maintained in a simple, direct and elegant way while simultaneously\nmitigating the impact of the dimensionality issue on learning performance. We\nfurther provide convergence analysis and extensively conduct experiments on \ufb01ve\nreal-world datasets to demonstrate the predictive performance and scalability of\nour proposed method in comparison with the state-of-the-art baselines.\n\nIntroduction\n\n1\nOnline learning represents a family of effective and scalable learning algorithms for incrementally\nbuilding a predictive model from a sequence of data samples [1]. Unlike the conventional learning\nalgorithms, which usually require a costly procedure to retrain the entire dataset when a new instance\narrives [2], the goal of online learning is to utilize new incoming instances to improve the model\ngiven knowledge of the correct answers to previously processed data. The seminal line of work in\nonline learning, referred to as linear online learning [3, 4], aims to learn a linear predictor in the\ninput space. The key limitation of this approach lies in its oversimpli\ufb01ed assumption in using a linear\nhyperplane to represent data that could possibly possess nonlinear dependency as commonly seen\nin many real-world applications. This inspires the work of kernel online learning [5, 6] that uses a\nlinear model in the feature space to capture the nonlinearity of input data.\nHowever, the kernel online learning approach suffers from the so-called curse of kernelization [7],\nthat is, the model size linearly grows with the data size accumulated over time. A notable approach\nto address this issue is to use a budget [8, 9, 7, 10, 11]. The work in [7] leveraged the budgeted\napproach with stochastic gradient descent (SGD) [12, 13] wherein the learning procedure employed\nSGD and a budget maintenance procedure (e.g., removal, projection, or merging) was employed to\nmaintain the model size. Although the projection and merging were shown to be effective [7], their\nassociated computational costs render them impractical for large-scale datasets. An alternative way\nto address the curse of kernelization is to use random features [14] to approximate a kernel function\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain\nAcknowledgment: This work is partially supported by the Australian Research Council under the Discovery\nProject DP160109394.\n\n\f[15, 16]. The work in [16] proposed to transform data from the input space to the random-feature\nspace, and then performed SGD in the feature space. However, in order for this approach to achieve\ngood kernel approximation, excessive number of random features is required, hence could lead to\nserious computational issue.\nIn this paper, we propose the Dual Space Gradient Descent (DualSGD) to address the computational\nproblem encountered in the projection and merging strategies in the budgeted approach [8, 9, 17, 7]\nand the excessive number of random features in the random feature approach [15, 16]. In particular,\nthe proposed DualSGD utilizes the random-feature space as an auxiliary space to store the information\nof the vectors that have been discarded during the budget maintenance process. More speci\ufb01cally, the\nDualSGD uses a provision vector in the random-feature space to store the information of all vectors\nbeing removed. This allows us to propose a novel budget maintenance strategy, named k-merging,\nwhich uni\ufb01es the removal, projection, and merging strategies.\n\nFigure 1: Comparison of DualSGD with BSGD-M and FOGD on the cod-rna dataset. Left: DualSGD\nvs. BSGD-M when B is varied. Right: DualSGD vs. FOGD when D is varied.\nOur proposed DualSGD advances the existing works in the budgeted and random-feature approaches\nin twofold. Firstly, since the goal of using random features is to approximate the original feature\nspace as much as possible, the proposed k-merging of DualSGD can preserve the information of\nthe removed vectors more effectively than the existing budget maintenance strategies. For example\ncomparing with the budgeted SGD using merging strategy (BSGD-M) [7], as shown in Fig. 1 (left),\nthe DualSGD with a small budget size (B = 5) can gain a signi\ufb01cant better mistake rate than that of\nBSGD-M with a 80-fold larger budget size (B = 400). Secondly, since the core part of the model\n(i.e., the vectors in the support set) is stored in the feature space and the auxiliary part (i.e., the\nremoved vectors) is stored in the random-feature space, our DualSGD can signi\ufb01cantly reduce the\nin\ufb02uence of the number of random features to the learning performance. For example comparing\nwith the Fourier Online Gradient Descent (FOGD) [16], as shown in Fig. 1 (right), the DualSGD\nwith a small number of random features (D = 20) can achieve a comparable mistake rate to that of\nFOGD with a 40-fold larger number of random features (D = 800) and the DualSGD with a medium\nvalue of number of random features (D = 100) achieves a predictive performance that would not\nbe reached by FOGD (the detail of comparison in computational complexities of our DualSGD and\nFOGD can be found in Section 3 in the supplementary material).\nTo provide theoretical foundation for DualSGD, we develop an extensive convergence analysis for a\nwide spectrum of loss functions including Hinge, Logistic, and smooth Hinge [18] for classi\ufb01cation\ntask and (cid:96)1, \u03b5-insensitive for regression. We conduct extensive experiments on \ufb01ve real-world datasets\nto compare the proposed method with the state-of-the-art online learning methods. The experimental\nresults show that our proposed DualSGD achieves the most optimal predictive results in almost all\ncases, whilst its execution time is much faster than the baselines.\n2 Dual Space Gradient Descent for Online Learning\n2.1 Problem Setting\nWe propose to solve the following optimization problem: min\nw\nde\ufb01ned for online setting as follows:\nJ (w) \u2261 \u03bb\n2\n\n(cid:107)w(cid:107)2 + E(x,y)\u223cpX ,Y [l (w, x, y)]\n\nJ (w) whose objective function is\n\n(1)\n\n2\n\n\fwhere x \u2208 RM is the data vector, y the label, pX ,Y denotes the joint distribution over X \u00d7 Y with\nthe data domain X and label domain Y, l (w, x, y) is a convex loss function with parameters w,\nand \u03bb \u2265 0 is a regularization parameter. A kernelization of the loss function introduces a nonlinear\nfunction \u03a6 that maps x from the input space to a feature space. A classic example is the Hinge loss:\n\nl (w, x, y) = max(cid:0)0, 1 \u2212 yw(cid:62)\u03a6 (x)(cid:1).\n\n2.2 The Key Ideas of the Proposed DualSGD\nOur key motivations come from the shortcomings of three current budget maintenance strategies:\nremoval, projection and merging. The removal strategy fails to retain information of the removed\nvectors. Although the projection strategy can overcome this problem, it requires a costly procedure\nto compute the inverse of an B \u00d7 B matrix wherein B is the budget size, typically in the cubic\ncomplexity of B. On the other hand, the merging strategy needs to estimate the preimage of a vector\nin the feature space, leading to a signi\ufb01cant information loss and requiring extensive computation.\nOur aim is to \ufb01nd an approach to simultaneously retain the information of the removed vectors\naccurately, and perform budget maintenance ef\ufb01ciently.\nTo this end, we introduce the k-merging, a new budget maintenance approach that uni\ufb01es three\naforementioned budget maintenance strategies under the following interpretation. For k = 1, the\nproposed k-merging can be seen as a hybrid strategy of removal and projection. For k = 2, it\ncan be regarded as the standard merging. Moreover, our proposed k-merging strategy enables an\narbitrary number of vectors to be conveniently merged. Technically, we employ a vector in the\nrandom-feature space [14], called provision vector \u02dcw, to retain the information of all removed vectors.\nWhen k-merging is invoked, the most redundant k vectors are sorted out, e.g., xi1, . . . , xik and we\n\n(cid:1) where \u03b1ij is the coef\ufb01cient of support vector associated\n(cid:1) denotes the mapping function from the input space to the random feature space.\n\nincrement \u02dcw as \u02dcw = \u02dcw +(cid:80)k\nwith xij , and z(cid:0)xij\n\nj=1 \u03b1ij z(cid:0)xij\n\n(xt, yt) \u223c pX ,Y\n\u02c6wt+1 = t\u22121\n\nThe advantage of using the random-feature space as an auxiliary space is twofold: 1) the information\nloss is negligible since the random-feature space is designed to approximate the original feature space,\nand 2) the operations in budget maintenance strategy are direct and economic.\nAlgorithm 1 The learning of Dual Space Gradient Descent.\nInput: Kernel K, regularization parameter \u03bb, budget B, random feature dimension D.\n1: \u02c6w1 = 0; \u02dcw1 = 0; b = 0; I0 = \u2205\n2: for t = 1, . . . , T do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\nend if\n11:\n12: end for\n\nIt = It\u22121\u222a {t}\n\u02c6wt+1 = \u02c6wt+1 \u2212 1\nif |It| > B then\nend if\n\n\u03bbt\u2207ol(cid:0)yt, oh\n\nif \u2207ol(cid:0)yt, oh\n\ninvokes k-merging(It, \u02c6wt+1, \u02dcwt+1)\n\nt \u02c6wt; \u02dcwt+1 = t\u22121\n\n(cid:1) (cid:54)= 0 then\n\n(cid:1) \u03a6 (xt)\n\nt\n\nt \u02dcwt\n\nt\n\nOutput: wh\n\nT +1 = \u02c6wT +1 \u2295 \u02dcwT +1 .\n\n2.3 The Proposed Algorithm\nIn our proposed DualSGD, the model is distributed into two spaces: the feature and random-feature\n(cid:44) \u02c6wt \u2295 \u02dcwt. Here we note that the kernel part \u02c6wt and\nspaces with a hybrid vector wh\nthe provision part \u02dcwt lie in two different spaces, thus for convenience we de\ufb01ne an abstract operator\n\u2295 to allow the addition between them, which implies that the decision function crucially depends on\nboth kernel and provision parts\n\nt de\ufb01ned as: wh\nt\n\n(cid:10)wh\nt , x(cid:11) (cid:44) (cid:104)( \u02c6wt \u2295 \u02dcwt) , x(cid:105) (cid:44) \u02c6wT\n\nt \u03a6 (x) + \u02dcw(cid:62)\n\nt z (x)\n\nWe employ one vector \u02dcwt in the random-feature space to preserve the information of the discarded vec-\ntors, that are outside It \u2013 the set of indices of all support vectors in \u02c6wt. When an instance arrives and\nthe model size exceeds the budget B, the budget maintenance procedure k-merging(It, \u02c6wt+1, \u02dcwt+1)\nis invoked to adjust \u02c6wt+1 and \u02dcwt+1, accordingly. Our proposed DualSGD is summarized in Algo-\nrithm 1 where we note that, l (y, o) is another representation of convex loss function w.r.t the variable\n\n3\n\n\fo (e.g., the Hinge loss given by l (y, o) = max (0, 1 \u2212 yo)), and oh\nhybrid objective value).\n2.4 k-merging Budget Maintenance Strategy\nCrucial to our proposed DualSGD in Algorithm 1 is the k-merging routine to allow ef\ufb01cient merging\nof k arbitrary vectors. We summarize the key steps for k-merging in Algorithm 2. In particular, we\n\ufb01rst select k support vectors whose corresponding coef\ufb01cients (\u03b1i1, \u03b1i2, ..., \u03b1ik) have the smallest\nabsolute values (cf. line 1). We then approximate them by z (xi1 ) , . . . , z (xik ) and merge them by\n\nupdating the provision vector as \u02dcwt+1 = \u02dcwt+1 +(cid:80)k\n\n(cid:1) (cf. line 2). Finally, we remove\n\nj=1 \u03b1ij z(cid:0)xij\n\nt \u03a6 (x) + \u02dcw(cid:62)\n\nt = \u02c6w(cid:62)\n\nt z (x) (i.e.,\n\nJ (w). We then prove that if {wt}\u221e\n\nthe chosen vectors from the kernel part \u02c6wt+1 (cf. line 2).\n2.5 Convergence Analysis\nIn this section, we present the convergence analysis for our proposed algorithm. We \ufb01rst prove that\nwith a high probability f h\nt (x) (i.e., hybrid decision function and cf. 3) is a good approximation of\nft (x) for all x and t (cf. Theorem 1). Let w(cid:63) be the optimal solution of the optimization problem\nde\ufb01ned in Eq. (1): w(cid:63) = argmin\nt=1 is constructed as in Eq. (2),\nthis sequence rapidly converges to w(cid:63) or ft (x) = w(cid:62)\nt \u03a6 (x) rapidly approaches the optimal decision\nfunction (cf. Theorems 2, 3). Therefore, the decision function f h\nt (x) also rapidly approaches the\noptimal decision function. Our analysis can be generalized for the general k-merging strategy, but for\ncomprehensibility we present the analysis for the 1-merging case (i.e., k = 1).\nWe assume that the loss function used in the analysis satis\ufb01es the condition |\u2207ol (y, o)| \u2264 A, \u2200y, o,\nwhere A is a positive constant. A wide spectrum of loss functions including Hinge, logistic, smooth\nHinge [18], (cid:96)1, and \u03b5-insensitive satisfy this condition and hence are appropriate for this convergence\nanalysis. We further assume that (cid:107)\u03a6 (x)(cid:107) = K (x, x)1/2 = 1, \u2200x. Let \u03b2t be a binary random\nvariable which indicates whether the budget maintenance procedure is performed at the iteration t\n\n(cid:1) (cid:54)= 0). We assume that if \u03b2t = 1, the vector \u03a6 (xit) is selected to move to\n\n(i.e., the event \u2207ol(cid:0)yt, oh\n\nthe random-feature space. Without loss of generality, we assume that it = t since we can arrange the\ndata instances so as to realize it. We de\ufb01ne\n\nw\n\nt\n\nt = \u03bbwt + \u2207ol(cid:0)yt, f h\n\ngh\n\nft (x) = w(cid:62)\n\nt(cid:88)\n\nt (xt)(cid:1) \u03a6 (xt) and wt+1 = wt \u2212 \u03b7tgh\nt(cid:88)\nt(cid:88)\n\n\u03b1j (1 \u2212 \u03b2j) K (xj, x) +\n\n\u03b1jK (xj, x)\n\nt \u03a6 (x) =\n\nj=1\n\nt\n\nt (x) = \u02c6w(cid:62)\nf h\n\nt \u03a6 (xt) + \u02dcw(cid:62)\n\nt z (xt) =\n\n\u03b1j\u03b2j \u02dcK (xj, x)\n\nj=1\n\nj=1\n\n(cid:62)\nwhere \u02dcK (x, x(cid:48)) = z (x)\nz (x(cid:48)) is the approximated kernel induced by the random-feature space,\n\u03bbt.\nand the learning rate \u03b7t = 1\nTheorem 1 establishes that f h\nt (.) is a good approximation of ft (x) with a high probability, followed\nby Theorem 2 which establishes the bound on the regret.\n\n(2)\n\n(3)\n\n(cid:1);\n\nendprod\n\nj\u2208It\n|\u03b1j|;\n\nIt = It\\ {i1, . . . , ik}\n\n1: (i1, . . . , ik) =k-argmin\nj\u2208It\n\nj=1 \u03b1ij z(cid:0)xij\n\nAlgorithm 2 k-merging Budget Maintenance Procedure.\n\nprocedure k-merging(It, \u02c6wt+1, \u02dcwt+1)\n\u03b1j\u03a6 (xj)\n\n// Assume that \u02c6wt+1 =(cid:80)\n\u02c6wt+1 = \u02c6wt+1 \u2212(cid:80)k\n2: \u02dcwt+1 = \u02dcwt+1 +(cid:80)k\nTheorem 1. With a probability at least 1 \u2212 \u03b8 = 1 \u2212 28(cid:16) \u03c3\u00b5AdX\ni)(cid:12)(cid:12)ft (x) \u2212 f h\nii) E(cid:2)(cid:12)(cid:12)ft (x) \u2212 f h\n\nt (x)(cid:12)(cid:12) \u2264 \u03b5 for all t > 0 and x \u2208 X .\nt (x)(cid:12)(cid:12)(cid:3) \u2264 A\u22121\u03bb\u03b5(cid:80)t\nE(cid:2)\u03b12\n(cid:3)1/2 \u00b5\n\n1/2\n\nj\n\nj=1\n\nj=1 \u03b1ij \u03a6(cid:0)xij\n(cid:17)\n\n(cid:1)\n(cid:16)\u2212 D\u03bb2\u03b52\n\n(cid:17)\n\nwhere M is\nexp\nthe dimension of input space, D is the dimension of random feature space,dX denotes the diameter of\nthe compact set X , and the constant \u03c3\u00b5 is de\ufb01ned as in [14], we have\n\n4(M +2)A2\n\n\u03bb\u03b5\n\nj where \u00b5j = p (\u03b2j = 1).\n\n4\n\n\fIt also indicates that to decrease the gap(cid:12)(cid:12)ft (x) \u2212 f h\n\nTheorem 1 shows that with a high probability f h\n\nwe should choose the vectors whose coef\ufb01cients have smallest absolute values to move to the\nrandom-feature space.\nTheorem 2. The following statement guarantees for all T\nE [J (wT )] \u2212 J (w(cid:63)) \u2264 E\n\n\u2264 8A2 (log T + 1)\n\nJ (wt) \u2212 J (w(cid:63))\n\nT(cid:88)\n\n(cid:34)\n\nW\n\n+\n\nt (x) can approximate ft (x) with an \u03b5-precision.\n\nt (x)(cid:12)(cid:12), when performing budget maintenance,\n(cid:35)\n(cid:3)1/2\nE(cid:2)M 2\n5(cid:1) \u03bb\u22121.\n\nT(cid:88)\nt (xt)(cid:1), and W = 2A(cid:0)1 +\n\n\u221a\n\n1\nT\n\n\u03bbT\n\nt=1\n\nt\n\nIf a smooth loss function is used, we can quantify the gap in more detail and with a high probability,\nthe gap is negligible and this is shown in Theorem 3.\nTheorem 3. Assume that l (y, o) is a \u03b3-strongly smooth loss function. With a probability at least\n\n1\nT\n\nt=1\n\nt=1 wt, Mt = \u2207ol (yt, ft (xt))\u2212\u2207ol(cid:0)yt, f h\n(cid:80)T\n(cid:16)\u2212 D\u03bb2\u03b52\n(cid:17)\n(cid:34)\nT(cid:88)\n\n, we have\n\n4(M +2)A2\n\n(cid:17)\n\n(cid:35)\n\nexp\n\nJ (wt) \u2212 J (w(cid:63))\n\nwhere wT = 1\nT\n\n1 \u2212 28(cid:16) \u03c3\u00b5AdX\n\n\u03bb\u03b5\n\nT(cid:88)\n\nt=1\n\n+\n\n1\nT\n\nW \u03b3\u03b5\n\n(cid:32)(cid:80)t\n\n(cid:33)1/2\n\nj=1 \u00b5j\nt\n\nE [J (wT )] \u2212 J (w(cid:63)) \u2264 E\n\n1\nT\n\nt=1\n\n\u2264 8A2 (log T + 1)\n\n\u03bbT\n\n\u2264 8A2 (log T + 1)\n\n\u03bbT\n\n+ W \u03b3\u03b5\n\n3 Experiments\nIn this section, we conduct comprehensive experiments to quantitatively evaluate the performance\nof our proposed Dual Space Gradient Descent (DualSGD) on binary classi\ufb01cation, multiclass clas-\nsi\ufb01cation and regression tasks under online settings. Our main goal is to examine the scalability,\nclassi\ufb01cation and regression capabilities of DualSGDs by directly comparing them with those of\nseveral recent state-of-the-art online learning approaches using a number of real-world datasets with\na wide range of sizes. In what follows, we present the data statistics, experimental setup, results and\nour observations.\n3.1 Data Statistics and Experimental Setup\nWe use 5 datasets which are ijcnn1, cod-rna, poker, year, and airlines. The datasets where purposely\nare selected with various sizes in order to clearly expose the differences among scalable capabilities\nof the models. Three of which are large-scale datasets with hundreds of thousands and millions of\ndata points (year: 515, 345; poker: 1, 025, 010; and airlines: 5, 929, 413), whilst the rest are medium\nsize databases (ijcnn1: 141, 691 and cod-rna: 331, 152). These datasets can be downloaded from\nLIBSVM1 and UCI2 websites, except the airlines which was obtained from American Statistical\nAssociation (ASA3). For the airlines dataset, our aim is to predict whether a \ufb02ight will be delayed or\nnot under binary classi\ufb01cation setting, and how long (in minutes) the \ufb02ight will be delayed in terms\nof departure time under regression setting. A \ufb02ight is considered delayed if its delay time is above\n15 minutes, and non-delayed otherwise. Following the procedure in [19], we extract 8 features for\n\ufb02ights in the year of 2008, and then normalize them into the range [0,1].\nFor each dataset, we perform 10 runs on each algorithm with different random permutations of the\ntraining data samples. In each run, the model is trained in a single pass through the data. Its prediction\nresult and time spent are then reported by taking the average together with the standard deviation over\nall runs. For comparison, we employ 11 state-of-the-art online kernel learning methods: perceptron\n[5], online gradient descent (OGD) [6], randomized budget perceptron (RBP) [9], forgetron [8]\nprojectron, projectron++ [20], budgeted passive-aggressive simple (BPAS) [17], budgeted SGD using\nmerging strategy (BSGD-M) [7], bounded OGD (BOGD) [21], Fourier OGD (FOGD) and Nystrom\nOGD (NOGD) [16]. Their implementations are published as a part of LIBSVM, BudgetedSVM4 and\nLSOKL5 toolboxes. We use a Windows machine with 3.46GHz Xeon processor and 96GB RAM to\nconduct our experiments.\n\n1https://www.csie.ntu.edu.tw/\u223ccjlin/libsvmtools/datasets/\n2https://archive.ics.uci.edu/ml/datasets.html\n3http://stat-computing.org/dataexpo/2009/.\n4http://www.dabi.temple.edu/budgetedsvm/index.html\n5http://lsokl.stevenhoi.com/\n\n5\n\n\f3.2 Model Evaluation on the Effect of Hyperparameters\nIn the \ufb01rst experiment, we investigate the effect of hyperparameters, i.e., budget size B, merging\nsize k and random feature dimension D (cf. Section 2) on the performance behavior of DualSGD.\nParticularly, we conduct an initial analysis to quantitatively evaluate the sensitivity of these hyperpa-\nrameters and their impact on the predictive accuracy and wall-clock time. This analysis provides an\napproach to \ufb01nd the best setting of hyperparameters. Here the DualSGD with Hinge loss is trained\non the cod-rna dataset under the online classi\ufb01cation setting.\n\nFigure 2: The effect of k-merging size on the mistake rate and running time (left). The effect of\nbudget size B and random feature dimension D on the mistake rate (middle) and running time (right).\nFirst we set B = 200, D = 100, and vary k in the range of 1, 2, 10, 20, 50, 100, 150. For each\nsetting, we run our models and record the average mistake rates and running time as shown in Fig. 2\n(left). There is a pattern that the classi\ufb01cation error increases for larger k whilst the wall-clock\ntime decreases. This represents the trade-off between model discriminative performance and model\ncomputational complexity via the number of merging vectors. In this analysis, we can choose k = 20\nto balance the performance and computational cost.\nFixing k = 20, we vary B and D in 4 values doubly increasing from 50 to 400 and from 100 to 800,\nrespectively, to evaluate the prediction performance and execution time. Fig. 2 depicts the average\nmistake rates (middle) and running time in seconds (right) as a heat map of these values. These\nvisualizations indicate that the higher B and D produce better classi\ufb01cation results, but hurt the\ntraining speed of the model. We found that increasing the dimension of random feature space from\n100 to 800 at B = 50 signi\ufb01cantly reduces the mistake rates by 25%, at the same time increases the\nwall-clock time by 76%. The same pattern with less effect is observed when increasing the budget\nsize B from 50 to 400 at D = 100 (mistake rate decreases by 1.5%, time increases by 54%). For a\ngood trade-off between classi\ufb01cation performance and computational cost, we select B = 100 and\nD = 200 which achieves fairly comparable classi\ufb01cation result and running time.\n3.3 Online Classi\ufb01cation\nWe now examine the performances of DualSGDs in the online classi\ufb01cation task. We use four\ndatasets: cod-rna, ijcnn1, poker and airlines (delayed and non-delayed labels). We create two\nversions of our approach: DualSGD with Hinge loss (DualSGD-Hinge) and DualSGD with Logistic\nloss (DualSGD-Logit). It is worth mentioning that the Hinge loss is not a smooth function with\nunde\ufb01ned gradient at the point that the classi\ufb01cation con\ufb01dence yf (x) = 1. Following the sub-\ngradient de\ufb01nition, in our experiment, we compute the gradient given the condition that yf (x) < 1,\nand set it to 0 otherwise.\nHyperparameters setting. There are a number of different hyperparameters for all methods. Each\nmethod requires a different set of hyperparameters, e.g., the regularization parameters (\u03bb in DualSGD),\nthe learning rates (\u03b7 in FOGD and NOGD), and the RBF kernel width (\u03b3 in all methods). Thus, for a\nfair comparison, these hyperparameters are speci\ufb01ed using cross-validation on a subset of data.\nIn particular, we further partition the training set into 80% for learning and 20% for valida-\ntion. For large-scale databases, we use only 1% of dataset, so that the searching can \ufb01n-\nish within an acceptable time budget. The hyperparameters are varied in certain ranges and\nselected for the best performance on the validation set. The ranges are given as follows:\nC \u2208{2\u22125, 2\u22123, ..., 215}, \u03bb \u2208{2\u22124/N, 2\u22122/N, ..., 216/N}, \u03b3 \u2208{2\u22128, 2\u22124, 2\u22122, 20, 22, 24, 28}, and\n\u03b7 \u2208{2\u22124, 2\u22123, ..., 2\u22121, 21, 22..., 24} where N is the number of data points. The budget size B,\nmerging size k and random feature dimension D of DualSGD are selected following the approach\ndescribed in Section 3.2. For the budget size \u02c6B in NOGD and Pegasos algorithm, and the feature\ndimension \u02c6D in FOGD for each dataset, we use identical values to those used in Section 7.1.1 of [16].\n\n6\n\n\fTable 1: Mistake rate (%) and execution time (seconds). The notation [k; B; D; \u02c6B; \u02c6D] denotes the\nmerging size k, the budget sizes B and \u02c6B of DualSGD-based models and other budgeted algorithms,\nand the number of random features D and \u02c6D of DualSGD and FOGD, respectively.\n\ncod-rna\n\nijcnn1\n\n[20 | 100 | 200 | 400 | 1, 600]\nMistake Rate\n9.79\u00b10.04\n7.81\u00b10.03\n26.02\u00b10.39\n28.56\u00b12.22\n11.16\u00b13.61\n17.97\u00b115.60\n11.97\u00b10.09\n5.33\u00b10.04\n38.13\u00b10.11\n7.15\u00b10.03\n7.83\u00b10.06\n4.92\u00b10.25\n4.83\u00b10.21\n\n1,393.56\n2,804.01\n85.84\n102.64\n97.38\n1,799.93\n92.08\n184.58\n104.60\n53.45\n105.18\n28.29\n31.96\n\n[20 | 100 | 200 | 1, 000 | 4, 000]\nTime\n727.90\n960.44\n54.29\n60.54\n59.37\n749.70\n55.44\n1,562.61\n55.99\n25.93\n59.36\n12.12\n13.30\n\nTime Mistake Rate\n12.85\u00b10.09\n10.39\u00b10.06\n15.54\u00b10.21\n16.17\u00b10.26\n12.98\u00b10.23\n9.97\u00b10.09\n10.68\u00b10.05\n9.14\u00b10.18\n10.87\u00b10.18\n9.41\u00b10.03\n10.43\u00b10.08\n8.35\u00b10.20\n8.82\u00b10.24\n\npoker\n\nairlines\n\n[20 | 100 | 200 | 1, 000 | 4, 000]\nMistake Rate\n52.28\u00b10.04\n44.90\u00b10.16\n46.73\u00b10.22\n46.65\u00b10.14\n\n[20 | 100 | 200 | 1, 000 | 4, 000]\nTime\n1,270.75\n3,553.50\n472.21\n523.23\n\nTime Mistake Rate\n20.98\u00b10.01\n928.89\n25.56\u00b10.01\n4,920.33\n19.28\u00b10.00\n139.87\n19.28\u00b10.00\n133.50\n\n(cid:104)\n\nDataset\n\nk | B | D | \u02c6B | \u02c6D\n\n(cid:105)\n\nAlgorithm\nPerceptron\n\nOGD\nRBP\n\nForgetron\nProjectron\nProjectron++\n\nBPAS\n\nBSGD-M\nBOGD\nFOGD\nNOGD\n\n(cid:104)\n\nDualSGD-Hinge\nDualSGD-Logit\n\nDataset [S]\n\nk | B | D | \u02c6B | \u02c6D\n\n(cid:105)\n\nAlgorithm\n\nFOGD\nNOGD\n\nDualSGD-Hinge\nDualSGD-Logit\n\nResults. Table 1 reports the average classi\ufb01cation results and execution time after the methods see\nall data samples. Note that for two biggest datasets (poker, airlines) that consist of millions of data\npoints, we only include the fast algorithms FOGD, NOGD and DualSGDs. The other methods would\nexceed the time limit, which we set to two hours, when running on such data as they suffer from\nserious computation issue. From these results, we can draw key observations below.\nThe budgeted online approaches show their effectiveness with substantially faster computation than\nthe ones without budgets. More speci\ufb01cally, the execution time of our proposed models is several\norders of magnitude (100 times) lower than that of regular online algorithms (e.g., 28.29 seconds\ncompared with 2, 804 seconds for cod-rna dataset). Moreover, our models are twice as fast as the\nrecent fast algorithm FOGD for cod-rna and ijcnn1 datasets, and approximately eight and three times\nfor vast-sized data poker and airlines. This is because the DualSGDs maintain a sparse budget of\nsupport vectors and a low random feature space, whose size and dimensionality are 10 times and 20\ntimes smaller than those of other methods.\nSecond, in terms of classi\ufb01cation, the DualSGD-Hinge and DualSGD-Logit outperform other meth-\nods for almost all datasets except the poker data. In particular, the DualSGD-based methods achieve\nthe best mistake rates 4.83\u00b10.21, 8.35\u00b10.20, 19.28\u00b10.00 for the cod-rna, ijcnn1 and airlines data,\nthat are, respectively, 32.4%, 11.3%, 8.8% lower than the error rates of the second best models \u2013\ntwo recent approaches FOGD and NOGD. For poker dataset, our methods obtain fairly comparable\nresults with that of the NOGD, but still surpass the FOGD with a large margin. The reason is that the\nDualSGD uses a dual space: a kernel space containing core support vectors and a random feature\nspace keeping the projections of the core vectors that are removed from the budget in kernel space.\nThis would minimize the information loss when the model performs budget maintenance.\nFinally, two versions of DualSGDs demonstrate similar discriminative performances and computa-\ntional complexities wherein the DualSGD-Logit is slightly slower due to the additional exponential\noperators. All of these observations validate the effectiveness and ef\ufb01ciency of our proposed tech-\nnique. Thus, we believe that our approximation machine is a promising technique for building\nscalable online kernel learning algorithms for large-scale classi\ufb01cation tasks.\n3.4 Online Regression\nThe last experiment addresses the online regression problem to evaluate the capabilities of our\napproach with two proposed loss functions: (cid:96)1 and \u03b5-insensitive losses. Incorporating these loss\nfunctions creates two versions: DualSGD-\u03b5, DualSGD-(cid:96)1. We use two datasets: year and airlines\n(delay minutes), and six baselines: RBP, Forgetron, Projectron, BOGD, FOGD and NOGD.\n\n7\n\n\fTable 2: Root mean squared error (RMSE) and execution time (seconds) of 6 baselines and 2 versions\nof our DualSGDs. The notation [k; B; D; \u02c6B; \u02c6D] denotes the same meaning as those in Table 1.\n\n(cid:104)\n\nDataset\n\nk | B | D | \u02c6B | \u02c6D\n\n(cid:105)\n\nyear\n\n[20 | 100 | 200 | 400 | 1, 600]\nTime\nRMSE\n0.19\u00b10.00\n605.42\n0.19\u00b10.00\n904.09\n0.14\u00b10.00\n605.19\n0.20\u00b10.00\n596.10\n0.16\u00b10.00\n76.70\n0.14\u00b10.00\n607.37\n0.13\u00b10.00\n48.01\n0.12\u00b10.00\n47.29\n\nairlines\n\nRMSE\n\n[20 | 100 | 200 | 1, 000 | 2, 000]\nTime\n36.51\u00b10.00\n3,418.89\n36.51\u00b10.00\n5,774.47\n36.14\u00b10.00\n3,834.19\n35.73\u00b10.00\n3,058.96\n53.16\u00b10.01\n646.15\n34.74\u00b10.00\n3,324.38\n36.20\u00b10.01\n457.30\n36.20\u00b10.01\n443.39\n\nAlgorithm\n\nRBP\n\nForgetron\nProjectron\n\nBOGD\nFOGD\nNOGD\n\nDualSGD-\u03b5\nDualSGD-(cid:96)1\n\nHyperparameters setting. We adopt the same hyperparameter searching procedure for online\nclassi\ufb01cation task as in Section 3.3. Furthermore, for the budget size \u02c6B and the feature dimension\n\u02c6D in FOGD, we follow the same strategy used in Section 7.1.1 of [16]. More speci\ufb01cally, these\nhyperparameters are separately set for different datasets as reported in Table 2. They are chosen\nsuch that they are roughly proportional to the number of support vectors produced by the batch SVM\nalgorithm in LIBSVM running on a small subset. The aim is to achieve competitive accuracy using a\nrelatively larger budget size for tackling more challenging regression tasks.\nResults. Table 2 reports the average regression errors and computation costs after the methods see\nall data samples. From these results, we can draw some observations below.\nOur proposed models enjoy a signi\ufb01cant advantage in computational ef\ufb01cacy whilst achieve better\n(for year dataset) or competitive regression results (for airlines dataset) with other methods. The\nDualSGD, again, secures the best performance in terms of model sparsity. Among the baselines, the\nFOGD is the fastest, that is, its time costs can be considered to compare with those of our methods,\nbut its regression performances are worse. The remaining algorithms usually obtain better results, but\nis paid by the sacri\ufb01ce of scalability.\nFinally, comparing the capability of two DualSGD\u2019s variants, both models demonstrate similar\nregression capabilities and computational complexities wherein the DualSGD-(cid:96)1 is slightly faster due\nto its simpler operator in computing the gradient. Besides, its regression scores are also lower or equal\nto those of DualSGD-\u03b5. These observations, once again, veri\ufb01es the effectiveness and ef\ufb01ciency of\nour proposed techniques. Therefore the DualSGD is also a promising machine to perform online\nregression task for large-scale datasets.\n4 Conclusion\nIn this paper, we have proposed Dual Space Gradient Descent (DualSGD) that overcomes the\ncomputational problem in the projection and merging strategies in Budgeted SGD (BSGD) and the\nexcessive number of random features in Fourier Online Gradient Descent (FOGD). More speci\ufb01cally,\nwe have employed the random features to form an auxiliary space for storing the vectors being\nremoved during the budget maintenance process. This makes the operations in budget maintenance\nsimple and convenient. We have further presented the convergence analysis that is appropriate for a\nwide spectrum of loss functions. Finally, we have conducted the extensive experiments on several\nbenchmark datasets to prove the ef\ufb01ciency and accuracy of the proposed method.\n\n8\n\n\fReferences\n[1] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization\n\nin the brain. Psychological Review, 65(6):386\u2013408, 1958.\n\n[2] C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM Trans. Intell.\n\nSyst. Technol., 2(3):27:1\u201327:27, May 2011.\n\n[3] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive\n\nalgorithms. J. Mach. Learn. Res., 7:551\u2013585, 2006.\n\n[4] M. Dredze, K. Crammer, and F. Pereira. Con\ufb01dence-weighted linear classi\ufb01cation. In Interna-\n\ntional Conference on Machine Learning 2008, pages 264\u2013271, 2008.\n\n[5] Y. Freund and R. E. Schapire. Large margin classi\ufb01cation using the perceptron algorithm. Mach.\n\nLearn., 37(3):277\u2013296, December 1999.\n\n[6] J. Kivinen, A. J. Smola, and R. C. Williamson. Online Learning with Kernels. IEEE Transactions\n\non Signal Processing, 52:2165\u20132176, August 2004.\n\n[7] Z. Wang, K. Crammer, and S. Vucetic. Breaking the curse of kernelization: Budgeted stochastic\n\ngradient descent for large-scale svm training. J. Mach. Learn. Res., 13(1):3103\u20133131, 2012.\n\n[8] O. Dekel, S. Shalev-Shwartz, and Y. Singer. The forgetron: A kernel-based perceptron on a\n\n\ufb01xed budget. In Advances in Neural Information Processing Systems, pages 259\u2013266, 2005.\n\n[9] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Tracking the best hyperplane with a simple\n\nbudget perceptron. Machine Learning, 69(2-3):143\u2013167, 2007.\n\n[10] T. Le, V. Nguyen, T. D. Nguyen, and Dinh Phung. Nonparametric budgeted stochastic gradient\ndescent. In The 19th International Conference on Arti\ufb01cial Intelligence and Statistics, May\n2016.\n\n[11] T. Le, P. Duong, M. Dinh, T. D. Nguyen, V. Nguyen, and D. Phung. Budgeted semi-supervised\nsupport vector machine. In The 32th Conference on Uncertainty in Arti\ufb01cial Intelligence, June\n2016.\n\n[12] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical\n\nStatistics, 22:400\u2013407, 1951.\n\n[13] S. Shalev-shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for\n\nsvm. In ICML 2007, pages 807\u2013814, 2007.\n\n[14] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\n\nNeural Infomration Processing Systems, 2007.\n\n[15] L. Ming, W. Shifeng, and Z. Changshui. On the sample complexity of random fourier features\nfor online learning: How many random fourier features do we need? ACM Trans. Knowl.\nDiscov. Data, 8(3):13:1\u201313:19, June 2014.\n\n[16] J. Lu, S. C.H. Hoi, J. Wang, P. Zhao, and Z.-Y. Liu. Large scale online kernel learning. J. Mach.\n\nLearn. Res., 2015.\n\n[17] Z. Wang and S. Vucetic. Online passive-aggressive algorithms on a budget.\n\nvolume 9, pages 908\u2013915, 2010.\n\nIn AISTATS,\n\n[18] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss. Journal of Machine Learning Research, 14(1):567\u2013599, 2013.\n\n[19] J. Hensman, N. Fusi, and N. D Lawrence. Gaussian processes for big data. In Uncertainty in\n\nArti\ufb01cial Intelligence, pages 282\u2013290, 2013.\n\n[20] F. Orabona, J. Keshet, and B. Caputo. Bounded kernel-based online learning. J. Mach. Learn.\n\nRes., 10:2643\u20132666, December 2009.\n\n[21] P. Zhao, J. Wang, P. Wu, R. Jin, and S. C. H. Hoi. Fast bounded online gradient descent\n\nalgorithms for scalable kernel-based online learning. CoRR, 2012.\n\n9\n\n\f", "award": [], "sourceid": 2289, "authors": [{"given_name": "Trung", "family_name": "Le", "institution": "University of Pedagogy Ho Chi Minh city"}, {"given_name": "Tu", "family_name": "Nguyen", "institution": "Deakin University"}, {"given_name": "Vu", "family_name": "Nguyen", "institution": "Deakin University"}, {"given_name": "Dinh", "family_name": "Phung", "institution": "Deakin University"}]}