{"title": "SEBOOST - Boosting Stochastic Learning Using Subspace Optimization Techniques", "book": "Advances in Neural Information Processing Systems", "page_first": 1534, "page_last": 1542, "abstract": "We present SEBOOST, a technique for boosting the performance of existing stochastic optimization methods. SEBOOST applies a secondary optimization process in the subspace spanned by the last steps and descent directions. The method was inspired by the SESOP optimization method for large-scale problems, and has been adapted for the stochastic learning framework. It can be applied on top of any existing optimization method with no need to tweak the internal algorithm. We show that the method is able to boost the performance of different algorithms, and make them more robust to changes in their hyper-parameters. As the boosting steps of SEBOOST are applied between large sets of descent steps, the additional subspace optimization hardly increases the overall computational burden. We introduce two hyper-parameters that control the balance between the baseline method and the secondary optimization process. The method was evaluated on several deep learning tasks, demonstrating promising results.", "full_text": "SEBOOST \u2013 Boosting Stochastic Learning Using\n\nSubspace Optimization Techniques\n\nElad Richardson*1 Rom Herskovitz*1 Boris Ginsburg2 Michael Zibulevsky1\n\n1Technion, Israel Institute of Technology 2Nvidia INC\n\n{eladrich,mzib}@cs.technion.ac.il {fornoch,boris.ginsburg}@gmail.com\n\nAbstract\n\nWe present SEBOOST, a technique for boosting the performance of existing stochas-\ntic optimization methods. SEBOOST applies a secondary optimization process\nin the subspace spanned by the last steps and descent directions. The method\nwas inspired by the SESOP optimization method, and has been adapted for the\nstochastic learning. It can be applied on top of any existing optimization method\nwith no need to tweak the internal algorithm. We show that the method is able\nto boost the performance of different algorithms, and make them more robust to\nchanges in their hyper-parameters. As the boosting steps of SEBOOST are applied\nbetween large sets of descent steps, the additional subspace optimization hardly\nincreases the overall computational burden. We introduce hyper-parameters that\ncontrol the balance between the baseline method and the secondary optimization\nprocess. The method was evaluated on several deep learning tasks, demonstrating\nsigni\ufb01cant improvement in performance. Video presentation is given in [15]\n\n1\n\nIntroduction\n\nStochastic Gradient Descent (SGD) based optimization methods are widely used for many different\nlearning problems. Given some objective function that we want to optimize, a vanilla gradient\ndescent method would simply take some \ufb01xed step in the direction of the current gradient. In\nmany learning problems the objective, or loss, function is averaged over the set of given training\nexamples. In that scenario calculating the loss over the entire training set would be expensive, and is\ntherefore approximated on a small batch, resulting in a stochastic algorithm that requires relatively\nfew calculations per step. The simplicity and ef\ufb01ciency of SGD algorithms have made them a\nstandard choice for many learning tasks, and speci\ufb01cally for deep learning [9, 6, 5, 10] . Although\nthe vanilla SGD has no memory of previous steps, they are usually utilized in some way, for example\nusing momentum [13]. Alternatively, the AdaGrad method uses the previous gradients in order to\nnormalize each component in the new gradient adaptively [3], while the ADAM method uses them to\nestimate an adaptive moment [8]. In this work we utilize the knowledge of previous steps in spirit\nof the Sequential Subspace Optimization (SESOP) framework [11]. The nature of SESOP allows\nit to be easily merged with existing algorithms. Several such extensions were introduced over the\nyears to different \ufb01elds, such as PCD-SESOP and SSF-SESOP, showing state-of-the-art results in\ntheir matching \ufb01elds [4, 17, 16].\nThe core idea of our method is as follows. At every outer iteration we \ufb01rst perform several steps\nof a baseline stochastic optimization algorithm which are then summed up as an inner cumulative\nstochastic step. Afterwards, we minimize the objective function over the af\ufb01ne subspace spanned\nby the cumulative stochastic step, several previous outer steps and optional other directions. The\nsubspace optimization boosts the performance of the baseline algorithm, therefore our method is\ncalled the Sequential Subspace Optimization Boosting method (SEBOOST).\n\n*Equal contribution\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f2 The algorithm\n\nAs our algorithm tries to \ufb01nd the balance between SGD and SESOP, we start by a brief review of the\noriginal algorithms, and then move to the SEBOOST algorithm.\n\n2.1 Vanilla SGD\n\nIn many different large-scale optimization problems, applying complex optimization methods is not\npractical. Thus, popular optimization methods for those problems are usually based on a stochastic\nestimation of the gradient. Let minx\u2208Rn f (x) be some minimization problem, and let g(x) be the\ngradient of f (x). The general stochastic approach applies the following optimization rule\n\nxk+1 = xk \u2212 \u03b7g\u2217(xk)\n\nwhere xi is the result of the ith iteration, \u03b7 is the learning rate and g\u2217(xk) is an approximation of\ng(xk) obtained using only a small subset (mini-batch) of the training data. These stochastic descent\nmethods have proved themselves in many different problems, speci\ufb01cally in the context of deep\nlearning algorithms, providing a combination of simplicity and speed. Notice that the vanilla SGD\nalgorithm has no memory of previous iterations. Different optimization methods which are based on\nSGD usually utilize the previous iterations in order to make a more informed descent process.\n\n2.2 Vanilla SESOP\n\nThe SEquential Subspace OPtimization Method [11, 16] is an optimization technique used for large\nscale optimization problems. The core idea of SESOP is to perform the optimization of the objective\nfunction in the subspace spanned by the current gradient direction and a set of directions obtained\nfrom the previous optimization steps. Following the notations in Section 2.1, a subspace structure for\nSESOP is usually de\ufb01ned based on the following directions:\n\n1. Gradients: Current gradient and [optionally] older ones {g (xi) : i = k, k \u2212 1, . . . k \u2212 s1}\n2. Previous directions: {pi = xi \u2212 xi\u22121 : i = k, k \u2212 1, . . . k \u2212 s2}\n\nIn the SESOP formulation the current gradient and the last step are mandatory and any other set can\nbe used to enrich the subspace. From a theoretical point of view, one can enrich the subspace by two\nNemirovsky directions: A weighted average of the previous gradients and the direction to the starting\npoint. This will provide optimal worst case complexity of the method (see also [12].) Denoting Pk as\nthe set of directions at iteration k, the SESOP algorithm would solve the minimization problem\n\n\u03b1k = arg min\n\n\u03b1\n\nf (xk + Pk\u03b1)\n\nxk+1 = xk + Pk\u03b1k\n\nThus SESOP reduces the optimization problem to the subspace spanned by Pk at each iteration. This\nmeans that instead of solving an optimization problem in Rn the dimensionality of the subspace is\ngoverned by the size of Pk and can be controlled.\n\n2.3 The SEBOOST algorithm\n\nAs explained in Section 2.1, when dealing with large-scale optimization problems, stochastic learning\nmethods are usually better \ufb01tted to the task then many more involved optimization methods. However,\nwhen applied correctly those methods can still be used to boost the optimization process and achieve\nfaster convergence rates. We propose to start with some SGD algorithm as a baseline, and then apply\na SESOP-like optimization method over it in an alternating manner. The subspace for the SESOP\nalgorithm arises from the descent directions of the baseline, utilizing the previous iterations.\nA description of the method is given in Algorithm 1. Note that the subset of the training data used for\nthe secondary optimization in step 7 isn\u2019t necessarily the same as that of the baseline in step 2, as will\nbe shown in Section 3. Also, note that in step 8 the last added direction is changed, that is done in\norder to incorporate the step performed by the secondary optimization into the subspace.\n\n2\n\n\fAlgorithm 1 The SEBOOST algorithm\n1: for k = 1, . . . do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nend if\nPerform optimization over subspace P to get from xk\nChange the last added direction to p = xk+1\n\n0 \u2212 xk\n\n0\n\nPerform (cid:96) steps of baseline stochastic optimization method to get from xk\nAdd the direction of the cumulative step xk\nif Subspace dimension exceeded the limit: dim(P ) > M then\nRemove oldest direction from the optimization subspace P\n\n0 to xk\n(cid:96)\n0 to the optimization subspace P\n\n(cid:96) \u2212 xk\n\n(cid:96) to xk+1\n\n0\n\nIt is clear that SEBOOST offers an attractive balance between the baseline stochastic steps and the\nmore costly subspace optimizations. Firstly, as the number (cid:96) of stochastic steps grows, the effect of\nsubspace optimization over the result subsides, where taking (cid:96) \u2192 \u221e reduces the algorithm back to\nthe baseline method. Secondly, the dimensionality of the subspace optimization problem is governed\nby the size of P and can be reduced to as few parameters as desired. Notice also that as SEBOOST is\nadded on top of baseline stochastic optimization method, it does not require any internal changes to\nbe made to the original algorithm. Thus, it can be applied on top of any such method with minimal\nimplementation cost, while potentially boosting the base method.\n\n2.4 Enriching the subspace\nAlthough the core elements of our optimization subspace are the directions of last M \u2212 1 external\nsteps and the new stochastic cumulative direction, many more elements can be added to enrich the\nsubspace.\nAnchor points As only the last (M \u2212 1) directions are saved in our subspace, the subspace has\nknowledge only of recent history of the optimization process. The subspace might bene\ufb01t from\ndirections dependent on preceding directions as well. For example, one could think of the overall\ndescent achieved by the algorithm p = xk\n0 as a possible direction, or the descent over the second\nhalf of the optimization process p = xk\nWe formulate this idea by de\ufb01ning anchor points. Anchors points are locations chosen throughout\nthe descent process which we \ufb01x and update only rarely. For each anchor point ai the direction\n0 \u2212 ai is added to the subspace. Different techniques can be chosen for setting and changing\np = xk\nthe anchors. In our formulation each point is associated with a parameter ri which describes the\nnumber of boosting steps between each update of the point. After every ri steps the corresponding\npoint ai is initialized back to the current x. That way we can control the number of iterations before\nan anchor point becomes irrelevant and is initialized again. Algorithm 2 shows how the anchor points\ncan be added to Algorithm 1, by incorporating it before step 7.\n\n0 \u2212 x0\n0 \u2212 xk/2\n\n.\n\n0\n\nCurrent gradient As in the SESOP formulation, the gradient at the current point can be added to\nthe subspace.\n\nMomentum Similarly to the idea of momentum in SGD methods one can save a weighted average\nof the previous updates and add it to the optimization subspace. Denoting the current momentum as\n0, the momentum is updated as mk+1 = \u00b5\u00b7mk + p, where \u00b5 is\nmk and the last step as p = xk+1\nsome hyper-parameter, as in regular SGD momentum.\n\n0 \u2212 xk\n\nif ri%k == 0 then\n\nAlgorithm 2 Controlling anchors in SEBOOST\n1: for i = 1, . . . , #anchors do\n2:\n3:\n4:\n5:\n6: end for\n\nend if\nNormalize the direction p = xk\n\nChange the anchor ai to xk\n(cid:96)\n\n(cid:96) \u2212 ai and add it to the subspace\n\n3\n\n\fFigure 1: Results for experiment 3.1. The baseline parameters was set as lrSGD = 0.5, lrN AG = 0.1,\nlrAdaGrad = 0.05, which provided good convergence. SEBOOST\u2019s parameters were \ufb01xed at M = 50\nand (cid:96) = 100 with 50 function evaluations for the secondary optimization.\n\n3 Experiments\n\nFollowing the recent rise of interest in deep learning tasks we focus our evaluation on different neural\nnetworks problems. We start with a small, yet challenging, regression problem and then proceed\nto the known problems of the MNIST autoencoder and CIFAR-10 classi\ufb01er. For each problem we\ncompare the results of baseline stochastic methods with our boosted variants, showing that SEBOOST\ncan give signi\ufb01cant improvement over the base method. Note that the purpose of our work is not to\ndirectly compete with existing methods, but rather to show that SEBOOST can improve each learning\nmethod compared to its\u2019 original variant, while preserving the original qualities of these algorithms.\nThe chosen baselines were SGD with momentum, Nesterov\u2019s Accelerated Gradient (NAG) [13] and\nAdaGrad [3]. The Conjugate Gradient (CG) [7] was used for the subspace optimization.\nOur algorithm was implemented and evaluated using the Torch7 framework [1], and is publicly\navailable 1. The main hyper-parameters that were altered during the experiments were:\n\n\u2022 lrmethod - The learning rate of a baseline method.\n\u2022 M - Maximal number of old directions.\n\u2022 (cid:96) - Number of baseline steps between each subspace optimization.\n\nFor all experiments the weight decay was set at 0.0001 and the momentum was \ufb01xed at 0.9 for SGD\nand NAG. Unless stated otherwise, the number of function evaluations for CG was set at 20. The\nbaseline method used a mini-batch of size 100, while the subspace optimization was applied with\na mini-batch of size 1000. Note that subspace optimization is applied over a signi\ufb01cantly larger\nbatch. That is because while a \u201cbad\u201d stochastic step will be canceled by the next ones, a single\nsecondary step has a bigger effect on the overall result and therefore requires better approximation of\nthe gradient. As the boosting step is applied only between large sets of the base method, the added\ncost does not hinder the algorithm.\nFor each experiment a different architecture will be de\ufb01ned. We will use the notation a \u2192L b to\ndenote a classic linear layer with a inputs and b outputs followed by a non-linear Tanh function.\nNotice that when presenting our results we show two different graphs. The right one always shows\nthe error as a function of the number of passes of the baseline algorithms over the data (i.e. epochs),\nwhile the left one shows the error as a function of the actual processor time, taking into account the\nadditional work required by the boosted algorithms.\n\n3.1 Simple regression\n\nWe will start by evaluating our method on a small regression problem. The dataset in question is\na set of 20,000 values simulating some continuous function f : R6 \u2192 R. The dataset was divided\n\n1https://github.com/eladrich/seboost\n\n4\n\n0510152025train time in seconds-5.5-5-4.5-4-3.5-3-2.5logarithmic test errorSimple RegressionSGDSEBOOST-SGDNAGSEBOOST-NAGADAGRADSEBOOST-ADAGRAD050100150200250300350400number of epochs-5.5-5-4.5-4-3.5-3-2.5logarithmic test errorSimple RegressionSGDSEBOOST-SGDNAGSEBOOST-NAGADAGRADSEBOOST-ADAGRAD\finto 18,000 training examples and 2,000 test examples. The problem was solved using a tiny neural\nnetwork with the architecture 6 \u2192L 12 \u2192L 8 \u2192L 4 \u2192L 1. Although the network size is very small\nthe resulting optimization problem remains challenging and gives clear indication of SEBOOST\u2019s\nbehavior. Figure 1 shows the optimization process for the different methods. In all examples the\nboosted variant converged faster. Note that the different variants of SEBOOST behave differently,\ngoverned by the corresponding baseline.\n\n3.2 MNIST autoencoder\n\nOne of the classic neural network formulation is that of an autoencoder, a network that tries to learn\nef\ufb01cient representation for a given set of data. An autoencoder is usually composed of two parts,\nthe encoder which takes the input and produces the compact representation and the decoder which\ntakes the representation and tries to reconstruct the original input. In our experiment the MNIST\ndataset was used, with 60,000 training images of size 28 \u00d7 28 and 10,000 test images. The encoder\nwas de\ufb01ned as three layer network with an architecture of form 784 \u2192L 200 \u2192L 100 \u2192L 64, with\na matching decoder 64 \u2192L 100 \u2192L 200 \u2192L 784.\nFigure 3 shows the optimization process for the autoencoder problem. A similar trend can be seen\nto that of experiment 3.1, SEBOOST is able to signi\ufb01cantly improve SGD and NAG and shows\nsome improvement over AdaGrad, although not as noticeable. A nice byproduct of working with\nan autoencoding problem is that one can visualize the quality of the reconstructions as a function of\nthe iterations. Figure 2 shows the change in reconstructions quality for SGD and SESOP-SGD, and\nshows that the boosting achieved is signi\ufb01cant in terms on the actual results.\n\nOriginal\n\n#10\n\n#30\n\n#100\n\n#200 Original\n\n#10\n\n#30\n\n#100\n\n#200\n\nFigure 2: Reconstruction Results. The \ufb01rst row shows results of the SGD algorithm, while the second\nrow shows results of SESOP-SGD. The last row gives the number of passes over the data.\n\n3.3 CIFAR-10 classi\ufb01er\n\nFor classi\ufb01cation purposes a standard benchmark is the CIFAR-10 dataset. The dataset is composed\nof 60,000 images of size 32 \u00d7 32 from 10 different classes, where each class has 6,000 different\nimages. 50,000 images are used for training and 10,000 for testing. In order to check SEBOOST\u2019s\nability to deal with large and modern networks the ResNet [6] architecture, winner of the ILSVRC\n2015 classi\ufb01cation task, is used.\n\nFigure 3: Results for experiment 3.2. The baseline parameters was set at lrSGD = 0.1, lrN AG = 0.01,\nlrAdaGrad = 0.01. SEBOOST\u2019s parameters were \ufb01xed at M = 10 and (cid:96) = 200.\n\n5\n\n0102030405060train time in seconds00.050.10.150.20.250.3MSE test errorMNIST AutoencoderSGDSEBOOST-SGDNAGSEBOOST-NAGADAGRADSEBOOST-ADAGRAD050100150200250300350400number of epochs00.050.10.150.20.250.3MSE test errorMNIST AutoencoderSGDSEBOOST-SGDNAGSEBOOST-NAGADAGRADSEBOOST-ADAGRAD\fFigure 4: Results for experiment 3.3. All baselines were set with lr = 0.1 and a mini-batch of size\n128. SEBOOST\u2019s parameters were \ufb01xed at M = 10 and (cid:96) = 391, with a mini-batch of size 1024.\n\nFigure 4 shows the optimization process and the achieved accuracy for ResNet of depth 32. Note that\nwe did not manually tweak the learning rate as was done in the original paper. While AdaGrad is not\nboosted for this experiment, SGD and NAG achieve signi\ufb01cant boosting and reach a better minimum.\nThe boosting step was applied only once every epoch, applying too frequent boosting steps resulted\nin a less stable optimization and higher minima, while applying infrequent steps also lead to higher\nminima. Experiment 3.4 shows similar results for MNIST and discusses them.\n\n3.4 Understanding the hyper-parameters\n\nSEBOOST introduces two hyper-parameters: (cid:96) the number of baseline steps between each subspace\noptimization and M the number of old directions to use. The purpose of the following two experiments\nis to measure the effect of those parameters on the achieved result and to give some intuition as to\ntheir meaning. All experiments are based on the MNIST autoencoder problem de\ufb01ned in Section 3.2.\nFirst, let us consider the parameter (cid:96), which controls the balance between the baseline SGD algorithm\nand the more involved optimization process. Taking small values of (cid:96) results in more steps of the\nsecondary optimization process, however each direction in the subspace is then composed of fewer\nsteps from the stochastic algorithm, making it less stable. Furthermore, recalling that our secondary\noptimization is more costly than regular optimization steps, applying it too often would hinder the\nalgorithm\u2019s performance. On the other hand, taking large values of (cid:96) weakens the effect of SEBOOST\nover the baseline algorithm.\nFigure 5a shows how (cid:96) affects the optimization process. One can see that applying the subspace\noptimization too frequently increases the algorithm\u2019s runtime and reaches an higher minimum than\nthe other variants, as expected. Although taking a large value of (cid:96) reaches a better minimum, taking\na value which is too large slows the algorithm. We can see that for this experiment taking (cid:96) = 200\nbalances correctly the trade-offs.\n\n(a)\n\n(b)\n\nFigure 5: Experiment 3.4, analyzing different changes in SEBOOST\u2019s hyper-parameters\n\n6\n\n050010001500200025003000train time in seconds0102030405060test error (%)CIFAR-10 ClassificationSGDSEBOOST-SGDNAGSEBOOST-NAGADAGRADSEBOOST-ADAGRAD020406080100120140160number of epochs0102030405060test error (%)CIFAR-10 ClassificationSGDSEBOOST-SGDNAGSEBOOST-NAGADAGRADSEBOOST-ADAGRAD020406080100120train time in seconds0.050.10.150.20.250.3MSE test errorMNIST Autoencoder - Baseline StepsNAGSEBOOST-NAG-50SEBOOST-NAG-200SEBOOST-NAG-800020406080100120train time in seconds0.050.10.150.20.250.3MSE test errorMNIST Autoencoder - History SizeNAGSEBOOST-NAG-5SEBOOST-NAG-10SEBOOST-NAG-20SEBOOST-NAG-50\f(a)\n\n(b)\n\nFigure 6: Experiment 3.5, analyzing different changes in SEBOOST\u2019s subspace\n\nLet us now consider the effect of M, which governs the size of the subspace in which the secondary\noptimization is applied. Although taking large values of M allows us to hold more directions and\napply the optimization in a larger subspace it also makes the optimization process more involved.\nFigure 5b shows how M affects the optimization process. Interestingly, the lower M is, the faster the\nalgorithm starts descending. However, larger M values tend to reach better minima. For M = 20 the\nalgorithm reaches the same minimum as M = 50, but starts the descent process faster, making it a\ngood choice for this experiment.\nTo conclude, the introduced hyper-parameters M and (cid:96) affect the overall boosting effect achieved by\nSEBOOST. Both parameters incorporate different trade-offs of the optimization problem and should\nbe considered when using the algorithm. Our own experiments show that a good initialization would\nbe to set (cid:96) so the algorithm runs about once or twice per epoch, and to set M between 10 to 20.\n\n3.5\n\nInvestigating the subspace\n\nOne of the key components of SEBOOST is the structure of the subspace in which the optimization\nis applied. The purpose of the following two experiments is to see how changes in the baseline\nalgorithm, or the addition of more directions, affect the algorithm. All experiments are based on the\nMNIST autoencoder problem de\ufb01ned in Section 3.2.\nIn the basic formulation of SEBOOST the subspace is composed only from the directions of the\nbaseline algorithm. In Section 3.2 we saw how choosing different baselines affect the algorithm.\nAnother experiment of interest is to see how our algorithm is in\ufb02uenced by changes in the hyper-\nparameters of the baseline algorithm. Figure 6a shows the effect of the learning rate over the baseline\nalgorithms and their boosted variants. It can be seen that the change in the original baseline affects our\nalgorithm, however the impact is noticeably smaller, showing that the algorithm has some robustness\nto the original learning rate.\nIn Section 2.4 a set of additional directions which can be added to the subspace were de\ufb01ned, these\ndirections can possibly enrich the subspace and improve the optimization process. Figure 6b shows the\nin\ufb02uence of those directions on the overall result. In SEBOOST-anchors a set of anchor points were\nadded with the r values of 500, 250, 100, 50 and 20. In SEBOOST-momnetum a momentum vector\nwith \u00b5 = 0.9 was used. It can be seen that using the proposed anchor directions can signi\ufb01cantly\nboost the algorithm. The momentum direction is less useful, giving a small boost on its own and\nactually slightly hinders the performance when used in conjunction with the anchor directions.\n\n4 Conclusion\n\nIn this paper we presented SEBOOST, a technique for boosting stochastic learning algorithms via a\nsecondary optimization process. The secondary optimization is applied in the subspace spanned by\nthe preceding descent steps, which can be further extended with additional directions. We evaluated\nSEBOOST on different deep learning tasks, showing the achieved results of our methods compared to\ntheir original baselines. We believe that the \ufb02exibility of SEBOOST could make it useful for different\nlearning tasks. One can easily change the frequency of the secondary optimization step, ranging from\n\n7\n\n20406080100120train time in seconds0.050.10.150.20.250.3MSE test errorMNIST Autoencoder - Learning RateNAG-0.05SEBOOST-NAG-0.05NAG-0.01SEBOOST-NAG-0.01NAG-0.005SEBOOST-NAG-0.005020406080100120train time in seconds0.050.10.150.20.250.3MSE test errorMNIST Autoencoder - Extra DirectionsNAGSEBOOST-NAG-basicSEBOOST-NAG-momentumSEBOOST-NAG-anchorsSESOP-NAG-both\ffrequent and more risky steps, to the more stable one step per epoch. Changing the baseline algorithm\nand the structure of the subspace allows us to further alter SEBOOST\u2019s behavior.\nAlthough this is not the focus of our work, an interesting research direction for SEBOOST is that of\nparallel computing. Similarly to [2, 14], one can look at a framework composed of a single master\nand a set of workers, where each worker optimizes a local model and the master saves a global set of\nparameters which is based on the workers. Inspired by SEBOOST, one can take the descent directions\nfrom each of the workers and apply a subspace optimization in the spanned subspace, allowing the\nmaster to take a more ef\ufb01cient step based on information from each of its workers.\nAnother interesting direction for future work is the investigation of pruning techniques. In our work,\nwhen the subspace if fully occupied the oldest direction is simply removed. One might consider more\nadvanced pruning techniques, such as eliminating the direction which contributed the least for the\nsecondary optimization step, or even randomly removing one of the subspace directions. A good\npruning technique can potentially have a signi\ufb01cant effect on the overall result. These two ideas will\nbe further researched in future work. Overall, we believe SEBOOST provides a promising balance\nbetween popular stochastic descent methods and more involved optimization techniques.\n\nAcknowledgements\n\nThe research leading to these results has received funding from the European Research Council under\nEuropean Unions Seventh Framework Program, ERC Grant agreement no. 320649 and was supported\nby the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI).\n\nReferences\n[1] Ronan Collobert, Koray Kavukcuoglu, and Cl\u00b4ement Farabet. Torch7: A matlab-like environment\n\nfor machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.\n\n[2] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew\nSenior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In\nAdvances in Neural Information Processing Systems, pages 1223\u20131231, 2012.\n\n[3] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[4] Michael Elad, Boaz Matalon, and Michael Zibulevsky. Coordinate and subspace optimization\nmethods for linear least squares with non-quadratic regularization. Applied and Computational\nHarmonic Analysis, 23(3):346\u2013367, 2007.\n\n[5] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for\naccurate object detection and semantic segmentation. In Proceedings of the IEEE conference on\ncomputer vision and pattern recognition, pages 580\u2013587, 2014.\n\n[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. arXiv preprint arXiv:1512.03385, 2015.\n\n[7] Magnus Rudolph Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving\n\nlinear systems, volume 49. NBS, 1952.\n\n[8] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[10] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for seman-\ntic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3431\u20133440, 2015.\n\n[11] Guy Narkiss and Michael Zibulevsky. Sequential subspace optimization method for large-scale\n\nunconstrained problems. Technion-IIT, Department of Electrical Engineering, 2005.\n\n8\n\n\f[12] Arkadi Nemirovski. Orth-method for smooth convex optimization. Izvestia AN SSSR, Transl.:\n\nEng. Cybern. Soviet J. Comput. Syst. Sci, 2:937\u2013947, 1982.\n\n[13] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initial-\nization and momentum in deep learning. In Proceedings of the 30th international conference on\nmachine learning (ICML-13), pages 1139\u20131147, 2013.\n\n[14] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging\n\nsgd. In Advances in Neural Information Processing Systems, pages 685\u2013693, 2015.\n\n[15] Michael Zibulevsky. SESOP - Sequential Subspace Optimization framework. Video presenta-\ntions, https://www.youtube.com/playlist?list=PLH39kM3nuavf2Hkr-gBAMBX7EPMB2kUqw.\n\n[16] Michael Zibulevsky. Speeding-up convergence via sequential subspace optimization: Current\n\nstate and future directions. arXiv preprint arXiv:1401.0159, 2013.\n\n[17] Michael Zibulevsky and Michael Elad. L1-l2 optimization in signal and image processing.\n\nSignal Processing Magazine, IEEE, 27(3):76\u201388, 2010.\n\n9\n\n\f", "award": [], "sourceid": 836, "authors": [{"given_name": "Elad", "family_name": "Richardson", "institution": "Technion"}, {"given_name": "Rom", "family_name": "Herskovitz", "institution": "Technion - Israel Institute of Technology"}, {"given_name": "Boris", "family_name": "Ginsburg", "institution": "Nvidia"}, {"given_name": "Michael", "family_name": "Zibulevsky", "institution": "Technion - Israel Institute of Technology"}]}