{"title": "Learning Supervised PageRank with Gradient-Based and Gradient-Free Optimization Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 4914, "page_last": 4922, "abstract": "In this paper, we consider a non-convex loss-minimization problem of learning Supervised PageRank models, which can account for features of nodes and edges. We propose gradient-based and random gradient-free methods to solve this problem. Our algorithms are based on the concept of an inexact oracle and unlike the state-of-the-art gradient-based method we manage to provide theoretically the convergence rate guarantees for both of them. Finally, we compare the performance of the proposed optimization methods with the state of the art applied to a ranking task.", "full_text": "Learning Supervised PageRank with Gradient-Based\n\nand Gradient-Free Optimization Methods\n\nLev Bogolubsky1,2, Gleb Gusev1,5, Andrei Raigorodskii5,2,1,8, Aleksey Tikhonov1, Maksim Zhukovskii1,5\n\n{bogolubsky, gleb57, raigorodsky, altsoph, zhukmax}@yandex-team.ru\n\nYandex1, Moscow State University2, Buryat State University8\n\nPavel Dvurechensky3,4, Alexander Gasnikov4,5\n\nWeierstrass Institute3, Institute for Information Transmission Problems RAS4,\n\nMoscow Institute of Physics and Technology5\n\npavel.dvurechensky@wias-berlin.de, gasnikov@yandex.ru\n\nYurii Nesterov6,7\n\nCenter for Operations Research and Econometrics6,\n\nHigher School of Economics7\n\nyurii.nesterov@uclouvain.be\n\nAbstract\n\nIn this paper, we consider a non-convex loss-minimization problem of learning\nSupervised PageRank models, which can account for features of nodes and edges.\nWe propose gradient-based and random gradient-free methods to solve this problem.\nOur algorithms are based on the concept of an inexact oracle and unlike the state-of-\nthe-art gradient-based method we manage to provide theoretically the convergence\nrate guarantees for both of them. Finally, we compare the performance of the\nproposed optimization methods with the state of the art applied to a ranking task.\n\n1\n\nINTRODUCTION\n\nThe most acknowledged methods of measuring importance of nodes in graphs are based on random\nwalk models. Particularly, PageRank [18], HITS [11], and their variants [8, 9, 19] are originally\nbased on a discrete-time Markov random walk on a link graph. Despite undeniable advantages of\nPageRank and its mentioned modi\ufb01cations, these algorithms miss important aspects of the graph\nthat are not described by its structure. In contrast, a number of approaches allows to account for\ndifferent properties of nodes and edges between them by encoding them in restart and transition\nprobabilities (see [3, 4, 6, 10, 12, 20, 21]). These properties may include, e.g., the statistics about\nusers\u2019 interactions with the nodes (in web graphs [12] or graphs of social networks [2]), types of\nedges (such as URL redirecting in web graphs [20]) or histories of nodes\u2019 and edges\u2019 changes [22].\nIn the general ranking framework called Supervised PageRank [21], weights of nodes and edges in a\ngraph are linear combinations of their features with coef\ufb01cients as the model parameters. The existing\noptimization method [21] of learning these parameters and the optimizations methods proposed\nin the presented paper have two levels. On the lower level, the following problem is solved: to\nestimate the value of the loss function (in the case of zero-order oracle) and its derivatives (in the\ncase of \ufb01rst-order oracle) for a given parameter vector. On the upper level, the estimations obtained\non the lower level of the optimization methods (which we also call inexact oracle information) are\nused for tuning the parameters by an iterative algorithm. Following [6], the authors of Supervised\nPageRank consider a non-convex loss-minimization problem for learning the parameters and solve\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fit by a two-level gradient-based method. On the lower level of this algorithm, an estimation of the\nstationary distribution of the considered Markov random walk is obtained by classical power method\nand estimations of derivatives w.r.t. the parameters of the random walk are obtained by power method\nintroduced in [23, 24]. On the upper level, the obtained gradient of the stationary distribution is\nexploited by the gradient descent algorithm. As both power methods give imprecise values of the\nstationary distribution and its derivatives, there was no proof of the convergence of the state-of-the-art\ngradient-based method to a stationary point.\nThe considered non-convex loss-minimization problem [21] can not be solved by existing optimization\nmethods such as [16] and [7] due to presence of constraints for parameter vector and the impossibility\nto calculate the exact value of the loss function. Moreover, standard global optimization methods can\nnot be applied, because they require unbiased estimations of the loss function.\nIn our paper, we propose two two-level methods to solve the problem [21]. On the lower level of\nthese methods, we use the linearly convergent method [17] to calculate an approximation to the\nstationary distribution of Markov random walk. We show that this method allows to approximate\nthe value of the loss function at any given accuracy and has the lowest proved complexity bound\namong methods proposed in [5]. We develop a gradient method for general constrained non-convex\noptimization problems with inexact oracle, estimate its convergence rate to the stationary point of the\nproblem. We exploit this gradient method on the upper level of the two-level algorithm for learning\nSupervised PageRank. Our contribution to the gradient-free methods framework consists in adapting\nthe approach of [16] to the case of constrained optimization problems when the value of the function\nis calculated with some known accuracy. We prove a convergence theorem for this method and\nexploit it on the upper level of the second two-level algorithm.\nAnother contribution consists in investigating both for the gradient and gradient-free methods the\ntrade-off between the accuracy of the lower-level algorithm, which is controlled by the number of\niterations of method in [17] and its generalization (for derivatives estimation), and the computational\ncomplexity of the two-level algorithm as a whole. Finally, we estimate the complexity of the whole\ntwo-level algorithms for solving the loss-minimization problem with a given accuracy.\nIn the experiments, we apply our algorithms to learning Supervised PageRank on a real ranking task.\nSumming up, both two-level methods, unlike the state-of-the-art [21], have theoretical guarantees\non convergence rate, and outperform it in the ranking quality in experiments. The main advantages\nof the \ufb01rst gradient-based algorithm: the guarantees of a convergence do not require the convexity,\nthis algorithm has less input parameters than gradient-free one. The main advantage of the second\ngradient-free algorithm is that it avoids calculating the derivative for each element of a large matrix.\n\n2 MODEL DESCRIPTION\n\nWe concider the following random walk on a directed graph \u0393 = (V, E) introduced in [21]. Assume\nthat each node i \u2208 V and each edge i \u2192 j \u2208 E is represented by a vector of features Vi \u2208 Rm1\n+ and\na vector of features Eij \u2208 Rm2\n+ respectively. A surfer starts from a random page v0 of a seed set\nU \u2282 V . The restart probability that v0 = i equals\n\n(2.1)\nand [\u03c00]i = 0 for i \u2208 V \\ U, where \u03d51 \u2208 Rm1 is a parameter, which conducts the random walk. We\n\n[\u03c00]i =\n\n(cid:80)\n(cid:104)\u03d51, Vi(cid:105)\nl\u2208U(cid:104)\u03d51, Vl(cid:105) ,\nl\u2208U(cid:104)\u03d51, Vl(cid:105) should be non-zero.\n\nassume that(cid:80)\n\nAt each step, the surfer makes a restart with probability \u03b1 \u2208 (0, 1) (originally [18], \u03b1 = 0.15) or\ntraverses an outgoing edge (makes a transition) with probability 1 \u2212 \u03b1. In the former case, the surfer\nchooses a vertex according to the distribution \u03c00. In the latter one, the transition probability of\ntraversing an edge i \u2192 j \u2208 E is\n\ni \u2208 U\n\n(cid:80)\n(cid:104)\u03d52, Eij(cid:105)\nl:i\u2192l(cid:104)\u03d52, Eil(cid:105) ,\n\n2\n\n(2.2)\nwhere \u03d52 \u2208 Rm2 is a parameter and the current position i has non-zero outdegree, and [P (\u03d5)]i,j =\n[\u03c00(\u03d5)]j for all j \u2208 V if the outdegree of i is zero (thus the surfer always makes a restart in this case).\n\n[P ]i,j =\n\nl:i\u2192l(cid:104)\u03d52, Eil(cid:105) is non-zero for all i with non-zero outdegree.\n\nWe assume that(cid:80)\n\n\fBy Equations 2.1 and 2.2, the total probability of choosing vertex j \u2208 V conditioned by the surfer\nbeing at vertex i equals \u03b1[\u03c00(\u03d5)]j + (1 \u2212 \u03b1)[P (\u03d5)]i,j, where \u03d5 = (\u03d51, \u03d52)T and we use \u03c00(\u03d5) and\nP (\u03d5) to express the dependence of \u03c00, P on the parameters.\nThe stationary distribution \u03c0(\u03d5) \u2208 Rp of the described Markov process is a solution of the system\n(2.3)\n\n\u03c0 = \u03b1\u03c00(\u03d5) + (1 \u2212 \u03b1)P T (\u03d5)\u03c0.\n\nIn this paper, we learn an algorithm, which ranks nodes i according to scores [\u03c0(\u03d5)]i.\nLet Q be a set of queries and a set of nodes Vq \u2282 V is associated to each query q. For example,\nvertices in Vq may represent web pages visited by users after submitting query q. For each q \u2208 Q,\nsome nodes of Vq are manually judged by relevance labels 1, . . . , (cid:96). Our goal is to learn the parameter\nvector \u03d5 of a ranking algorithm \u03c0q = \u03c0q(\u03d5) which minimizes the discrepancy of its ranking scores\n[\u03c0q]i, i \u2208 Vq, from the the assigned labels. We consider the square loss function [12, 21, 22]\n\n|Q|(cid:88)\n\nq=1\n\nf (\u03d5) =\n\n1\n|Q|\n\n(cid:107)(Aq\u03c0q(\u03d5))+(cid:107)2\n2.\n\n(2.4)\n\nEach row of matrix Aq \u2208 Rrq\u00d7pq corresponds to some pair of pages i1, i2 \u2208 Vq such that the label of\ni1 is strictly greater than the label of i2 (we denote by rq the number of all such pairs from Vq and\npq := |Vq|). The i1-th element of this row is equal to \u22121, i2-th element is equal to 1, and all other\nelements are equal to 0. Vector x+ has components [x+]i = max{xi, 0}.\nTo make ranking scores (2.3) query\u2013dependent, we assume that \u03c0 is de\ufb01ned on a query\u2013dependent\nij, i \u2192 j \u2208 Eq. For\ngraph \u0393q = (Vq, Eq) with query-dependent feature vectors Vq\nexample, these features may re\ufb02ect different aspects of query\u2013page relevance. For a given q \u2208 Q,\nwe consider all the objects related to the graph \u0393q introduced above: Uq := U, \u03c00\nq := \u03c00, Pq := P ,\n\u03c0q := \u03c0. In this way, the ranking scores \u03c0q depend on query via the query\u2013dependent features, but\nthe parameters of the model \u03b1 and \u03d5 are not query\u2013dependent. In what follows, we use the following\nnotations throughout the paper: nq := |Uq|, m = m1 + m2, r = maxq\u2208Q rq, p = maxq\u2208Q pq,\nn = maxq\u2208Q nq, s = maxq\u2208Q sq, where sq = maxi\u2208Vq |{j : i \u2192 j \u2208 Eq}|.\nIn order to\nguarantee that the probabilities in (2.1) and (2.2) are correctly de\ufb01ned, we need to appropriately\nchoose a set \u03a6 of possible values of parameters \u03d5. We choose some \u02c6\u03d5 and R > 0 such that\n\u03a6 = {\u03d5 \u2208 Rm : (cid:107)\u03d5 \u2212 \u02c6\u03d5(cid:107)2 \u2264 R} lies in the set of vectors with positive components Rm\n1. In this\npaper, we solve the following loss-minimization problem:\n\ni , i \u2208 Vq, Eq\n\n++\n\nf (\u03d5), \u03a6 = {\u03d5 \u2208 Rm : (cid:107)\u03d5 \u2212 \u02c6\u03d5(cid:107)2 \u2264 R}.\n\nmin\n\u03d5\u2208\u03a6\n\n(2.5)\n\n3 NUMERICAL CALCULATION OF f (\u03d5) AND \u2207f (\u03d5)\n\nOur goal is to provide methods for solving Problem 2.5 with guarantees on rate of convergence and\ncomplexity bounds. The calculation of the values of f (\u03d5) and its gradient \u2207f (\u03d5) is problematic,\nsince it requires to calculate those for |Q| vectors \u03c0q(\u03d5) de\ufb01ned by Equation 2.3. While the exact\nvalues are impossible to derive in general, existing methods provide estimations of \u03c0q(\u03d5) and its\nderivatives d\u03c0q(\u03d5)\nin an iterative way with a trade-off between time and accuracy. To be able to\nd\u03d5T\nguarantee convergence of our optimization algorithm in this inexact oracle setting, we consider\nnumerical methods that calculate approximation for \u03c0q(\u03d5) and its derivatives with any required\naccuracy. We have analysed state-of-the-art methods summarized in the review [5] and power method\nused in [18, 2, 21] and have found that the method [17] is the most suitable.\nIt constructs a sequence \u03c0k and outputs \u02dc\u03c0q(\u03d5, N ) by the following rule (integer N > 0 is a parameter):\n\n\u03c00 = \u03c00\n\nq (\u03d5),\n\n\u03c0k+1 = P T\n\nq (\u03d5)\u03c0k,\n\n\u02dc\u03c0q(\u03d5, N ) =\n\n\u03b1\n\n1 \u2212 (1 \u2212 \u03b1)N +1\n\n(1 \u2212 \u03b1)k\u03c0k.\n\n(3.1)\n\n1As probablities [\u03c00\n\nq (\u03d5), Pq(\u03bb\u03d5) =\nPq(\u03d5)), in our experiments, we consider the set \u03a6 = {\u03d5 \u2208 Rm : (cid:107)\u03d5 \u2212 em(cid:107)2 \u2264 0.99} , where em \u2208 Rm is the\nvector of all ones, that has large intersection with the simplex {\u03d5 \u2208 Rm\n\nq (\u03d5)]i, i \u2208 Vq, [Pq(\u03d5)]\u02dci,i, \u02dci \u2192 i \u2208 Eq, are scale-invariant (\u03c00\n\n++ : (cid:107)\u03d5(cid:107)1 = 1}\n\nk=0\nq (\u03bb\u03d5) = \u03c00\n\nN(cid:88)\n\n3\n\n\fLemma 1. Assume that, for some \u03b41 > 0, Method 3.1 with N =\nthe vector \u02dc\u03c0q(\u03d5, N ) for every q \u2208 Q. Then \u02dcf (\u03d5, \u03b41) = 1|Q|\n2 satis\ufb01es\n| \u02dcf (\u03d5, \u03b41) \u2212 f (\u03d5)| \u2264 \u03b41. Moreover, the calculation of \u02dcf (\u03d5, \u03b41) requires not more than |Q|(3mps +\n3psN + 6r) a.o.\n\n(cid:109) \u2212 1 is used to calculate\n\n(cid:108) 1\n(cid:80)|Q|\n\u03b1 ln 8r\n\u03b41\nq=1 (cid:107)(Aq \u02dc\u03c0q(\u03d5, N ))+(cid:107)2\n\nThe proof of Lemma 1 is in Supplementary Materials.\nq (\u03d5). Our generalization of the method [17] for\nLet pi(\u03d5) be the i-th column of the matrix P T\nfor any q \u2208 Q is the following. Choose some non-negative integer N1 and\ncalculation of d\u03c0q(\u03d5)\nd\u03d5T\ncalculate \u02dc\u03c0q(\u03d5, N1) using (3.1). Choose some N2 \u2265 0, calculate \u03a0k, k = 0, ..., N2 and \u02dc\u03a0q(\u03d5, N2)\n\n\u03a00 = \u03b1\n\nd\u03c00\n\nq (\u03d5)\n\nd\u03d5T + (1 \u2212 \u03b1)\n\ndpi(\u03d5)\nd\u03d5T [\u02dc\u03c0q(\u03d5, N1)]i, \u03a0k+1 = P T\n\nq (\u03d5)\u03a0k,\n\n(3.2)\n\n(3.3)\n\npq(cid:88)\n\ni=1\n\nN2(cid:88)\n\nk=0\n\n1\n\n1 \u2212 (1 \u2212 \u03b1)N2+1\n\n(1 \u2212 \u03b1)k\u03a0k.\n\n\u02dc\u03a0q(\u03d5, N2) =\n\n(cid:80)n1\ni=1 |aij|.\n\n(cid:108) 1\n\nIn what follows, we use the following norm on the space of matrices A \u2208 Rn1\u00d7n2: (cid:107)A(cid:107)1 =\nmaxj=1,...,n2\nLemma 2. Let \u03b21 be some explicitly computable constant (see Supplementary Materials). Assume\n\u03b1 ln 24\u03b21r\nthat Method 3.1 with N1 =\n\u03b1\u03b42\n\u02dc\u03c0q(\u03d5, N1) and Method 3.2, 3.3 with N2 =\n\n(cid:109) \u2212 1 is used for every q \u2208 Q to calculate the vector\n(cid:108) 1\n(cid:109) \u2212 1 is used for every q \u2208 Q to calculate the\n(cid:80)|Q|\n\nmatrix \u02dc\u03a0q(\u03d5, N2) (3.3). Then the vector \u02dcg(\u03d5, \u03b42) = 2|Q|\nq (Aq \u02dc\u03c0q(\u03d5, N1))+\nsatis\ufb01es (cid:107)\u02dcg(\u03d5, \u03b42) \u2212 \u2207f (\u03d5)(cid:107)\u221e \u2264 \u03b42. Moreover, the calculation of \u02dcg(\u03d5, \u03b42) requires not more than\n|Q|(10mps + 3psN1 + 3mpsN2 + 7r) a.o.\nThe proof of Lemma 2 can be found in Supplementary Materials.\n\n(cid:16) \u02dc\u03a0q(\u03d5, N2)\n(cid:17)T\n\n\u03b1 ln 8\u03b21r\n\u03b1\u03b42\n\nAT\n\nq=1\n\n4 RANDOM GRADIENT-FREE OPTIMIZATION METHODS\n\nIn this section, we \ufb01rst describe general framework of random gradient-free methods with inexact\noracle and then apply it for Problem 2.5. Lemma 1 allows to control the accuracy of the inexact\nzero-order oracle and hence apply random gradient-free methods with inexact oracle.\n\n4.1 GENERAL FRAMEWORK\n\nBelow we extend the framework of random gradient-free methods [1, 16, 7] for the situation of\npresence of uniformly bounded error of unknown nature in the value of an objective function in\ngeneral optimization problem. Unlike [16], we consider a constrained optimization problem and a\nrandomization on a Euclidean sphere which seems to give better large deviations bounds and doesn\u2019t\nneed the assumption that the objective function can be calculated at any point of Rm.\nLet E be a m-dimensional vector space and E\u2217 be its dual. In this subsection, we consider a general\nfunction f (\u00b7) : E \u2192 R and denote its argument by x or y to avoid confusion with other sections. We\ndenote the value of linear function g \u2208 E\u2217 at x \u2208 E by (cid:104)g, x(cid:105). We choose some norm (cid:107) \u00b7 (cid:107) in E and\n\u2200x, y \u2208 E. The problem\nsay that f \u2208 C 1,1\nof our interest is to \ufb01nd minx\u2208X f (x), where f \u2208 C 1,1\nL ((cid:107) \u00b7 (cid:107)), X is a closed convex set and there\nexists a number D \u2208 (0, +\u221e) such that diamX := maxx,y\u2208X (cid:107)x \u2212 y(cid:107) \u2264 D. Also we assume that\nthe inexact zero-order oracle for f (x) returns a value \u02dcf (x, \u03b4) = f (x) + \u02dc\u03b4(x), where \u02dc\u03b4(x) is the error\nsatisfying for some \u03b4 > 0 (which is known) |\u02dc\u03b4(x)| \u2264 \u03b4 for all x \u2208 X. Let x\u2217 \u2208 arg minx\u2208X f (x).\nDenote f\u2217 = minx\u2208X f (x).\n\u03c4 ( \u02dcf (x + \u03c4 \u03be, \u03b4)\u2212 \u02dcf (x, \u03b4))\u03be, where\nUnlike [16], we de\ufb01ne the biased gradient-free oracle g\u03c4 (x, \u03b4) = m\n\u03be is a random vector uniformly distributed over the unit sphere S = {t \u2208 Rm : (cid:107)t(cid:107)2 = 1}, \u03c4 is a\nsmoothing parameter.\n\nL ((cid:107)\u00b7(cid:107)) iff |f (x)\u2212 f (y)\u2212(cid:104)\u2207f (y), x\u2212 y(cid:105)| \u2264 L\n\n2 (cid:107)x\u2212 y(cid:107)2,\n\n4\n\n\fAlgorithm 1 Gradient-type method\n\nInput: Point x0 \u2208 X, stepsize h > 0, number of steps M.\nSet k = 0.\nrepeat\n\nGenerate \u03bek and calculate corresponding g\u03c4 (xk, \u03b4).\nCalculate xk+1 = \u03a0X (xk \u2212 hg\u03c4 (xk, \u03b4)) (\u03a0X (\u00b7) \u2013 Euclidean projection onto the set X).\nSet k = k + 1.\n\nuntil k > M\nOutput: The point yM = arg minx{f (x) : x \u2208 {x0, . . . , xM}}.\n\nTheorem 1. Let f \u2208 C 1,1\ngenerated by Algorithm 1 with h = 1\nM +1 + \u03c4 2L(m+8)\n8mLD2\nvector \u03be.\n\nL ((cid:107) \u00b7 (cid:107)2) and convex. Assume that x\u2217 \u2208 intX, and the sequence xk is\n8mL . Then for any M \u2265 0, we have E\u039eM\u22121 f (yM ) \u2212 f\u2217 \u2264\nL\u03c4 2 . Here \u039ek = (\u03be0, . . . , \u03bek) is the history of realizations of the\n\n8\n\n+ \u03b4mD\n\n4\u03c4 + \u03b42m\n\nThe full proof of the theorem is in Supplementary Materials.\n\n4.2 SOLVING THE LEARNING PROBLEM\n\nNow, we apply the results of Subsection 4.1 to solve Problem 2.5. Note that presence of constraints\nand oracle inexactness do not allow to directly apply the results of [16]. We assume that there is a\nlocal minimum \u03d5\u2217, and \u03a6 is a small vicinity of \u03d5\u2217, in which f (\u03d5) (2.4) is convex (generally speaking,\nit is nonconvex). We choose the desired accuracy \u03b5 for f\u2217 (the optimal value) approximation in the\nsense that E\u039eM\u22121 f (yM ) \u2212 f\u2217 \u2264 \u03b5. In accordance with Theorem 1, \u03b5 gives the number of steps\nM of Algorithm 1, the value of \u03c4, the value of the required accuracy \u03b4 of the inexact zero-order\noracle. The value \u03b4, by Lemma 1, gives the number of steps N of Method 3.1 required to calculate\na \u03b4-approximation \u02dcf (\u03d5, \u03b4) for f (\u03d5). Then the inexact zero-order oracle \u02dcf (\u03d5, \u03b4) is used to make\nAlgorithm 1 step. Theorem 1 and the choice of the feasible set \u03a6 to be a Euclidean ball make it\nnatural to choose (cid:107) \u00b7 (cid:107)2-norm in the space Rm of parameter \u03d5. It is easy to see that in this norm\ndiam\u03a6 \u2264 2R. Algorithm 2 in Supplementary Materials is a formal record of these ideas.\nThe most computationally hard on each iteration of the main cycle of this method are calculations\nof \u02dcf (\u03d5k + \u03c4 \u03bek, \u03b4), \u02dcf (\u03d5k, \u03b4). Using Lemma 1, we obtain the complexity of each iteration and the\nfollowing result, which gives the complexity of Algorithm 2.\nTheorem 2. Assume that the set \u03a6 in (2.5) is chosen in a way such that f (\u03d5) is convex on \u03a6 and some\n\u03d5\u2217 \u2208 arg min\u03d5\u2208\u03a6 f (\u03d5) belongs also to int\u03a6. Then the mean total number of arithmetic operations\nof the Algorithm 2 for the accuracy \u03b5 (i.e. for the inequality E\u039eM\u22121f ( \u02c6\u03d5M ) \u2212 f (\u03d5\u2217) \u2264 \u03b5 to hold) is\nnot more than\n\n(cid:32)\n\n128mrR(cid:112)L(m + 8)\n\n\u221a\n\n\u03b53/2\n\n2\n\n(cid:33)\n\n+ 6r\n\n.\n\n768mps|Q| LR2\n\u03b5\n\nm +\n\n1\n\u03b1\n\nln\n\n5 GRADIENT-BASED OPTIMIZATION METHODS\n\nIn this section, we \ufb01rst develop a general framework of gradient methods with inexact oracle for\nnon-convex problems from rather general class and then apply it for the particular Problem 2.5.\nLemma 1 and Lemma 2 allow to control the accuracy of the inexact \ufb01rst-order oracle and hence apply\nproposed framework.\n\n5.1 GENERAL FRAMEWORK\n\nIn this subsection, we generalize the approach in [7] for constrained non-convex optimization\nproblems. Our main contribution consists in developing this framework for an inexact \ufb01rst-order\noracle and unknown \"Lipschitz constant\" of this oracle.\nWe consider a composite optimization problem of the form minx\u2208X{\u03c8(x) := f (x) + h(x)}, where\nX \u2282 E is a closed convex set, h(x) is a simple convex function, e.g. (cid:107)x(cid:107)1. We assume that f (x) is\n\n5\n\n\fa general function endowed with an inexact \ufb01rst-order oracle in the following sense. There exists\na number L \u2208 (0, +\u221e) such that for any \u03b4 \u2265 0 and any x \u2208 X one can calculate \u02dcf (x, \u03b4) \u2208 R and\n\u02dcg(x, \u03b4) \u2208 E\u2217 satisfying\n\n|f (y) \u2212 ( \u02dcf (x, \u03b4) \u2212 (cid:104)\u02dcg(x, \u03b4), y \u2212 x(cid:105))| \u2264 L\n2\n\n(5.1)\nfor all y \u2208 X. The constant L can be considered as \"Lipschitz constant\" because for the exact \ufb01rst-\norder oracle for a function f \u2208 C 1,1\nL ((cid:107) \u00b7 (cid:107)) Inequality 5.1 holds with \u03b4 = 0. This is a generalization\nof the concept of (\u03b4, L)-oracle considered in [25] for convex problems.\nWe choose a prox-function d(x) which is continuously differentiable and 1-strongly convex on X\nwith respect to (cid:107) \u00b7 (cid:107). This means that for any x, y \u2208 X d(y) \u2212 d(x) \u2212 (cid:104)\u2207d(x), y \u2212 x(cid:105) \u2265 1\n2(cid:107)y \u2212 x(cid:107)2.\nWe de\ufb01ne also the corresponding Bregman distance V (x, z) = d(x) \u2212 d(z) \u2212 (cid:104)\u2207d(z), x \u2212 z(cid:105).\n\n(cid:107)x \u2212 y(cid:107)2 + \u03b4.\n\nAlgorithm 2 Adaptive projected gradient algorithm\n\nInput: Point x0 \u2208 X, number L0 > 0.\nSet k = 0, z = +\u221e.\nrepeat\n\nSet Mk = Lk, \ufb02ag = 0.\nrepeat\n\n16Mk\n\n. Calculate \u02dcf (xk, \u03b4) and \u02dcg(xk, \u03b4).\n\nSet \u03b4 = \u03b5\nFind wk = arg minx\u2208Q {(cid:104)\u02dcg(xk, \u03b4), x(cid:105) + MkV (x, xk) + h(x)} and calculate \u02dcf (wk, \u03b4).\nIf the inequality \u02dcf (wk, \u03b4) \u2264 \u02dcf (xk, \u03b4) + (cid:104)\u02dcg(xk, \u03b4), wk \u2212 xk(cid:105) + Mk\nset \ufb02ag = 1. Otherwise set Mk = 2Mk.\n\n2 (cid:107)wk \u2212 xk(cid:107)2 + \u03b5\n\n8Mk\n\nholds,\n\nuntil \ufb02ag = 1\nSet xk+1 = wk, Lk+1 = Mk\n2 .\nIf (cid:107)Mk(xk \u2212 xk+1)(cid:107) < z, set z = (cid:107)Mk(xk \u2212 xk+1)(cid:107), K = k.\nSet k = k + 1.\n\nuntil z \u2264 \u03b5\nOutput: The point xK+1.\n\nTheorem 3. Assume that f (x) is endowed with the inexact \ufb01rst-order oracle in a sense (5.1) and\nthat there exists a number \u03c8\u2217 > \u2212\u221e such that \u03c8(x) \u2265 \u03c8\u2217 for all x \u2208 X. Then after M iterations of\nAlgorithm 2 it holds that (cid:107)MK(xK \u2212 xK+1)(cid:107)2 \u2264 4L(\u03c8(x0)\u2212\u03c8\u2217)\n2 . Moreover, the total number of\ninexact oracle calls is not more than 2M + 2 log2\n\n+ \u03b5\n\nM +1\n\n.\n\n2L\nL0\n\nThe full proof of the theorem is in Supplementary Materials.\n\n5.2 SOLVING THE LEARNING PROBLEM\n\n\u221a\n\nIn this subsection, we return to Problem 2.5 and apply the results of the previous subsection. Note\nthat we can not directly apply the results of [7] due to the inexactness of the oracle. For this problem,\nh(\u00b7) \u2261 0. It is easy to show that in 1-norm diam\u03a6 \u2264 2R\nm. For any \u03b4 > 0, Lemma 1 with \u03b41 = \u03b4\n2\nallows us to obtain \u02dcf (\u03d5, \u03b41) such that inequality | \u02dcf (\u03d5, \u03b41) \u2212 f (\u03d5)| \u2264 \u03b41 holds and Lemma 2 with\nm allows us to obtain \u02dcg(\u03d5, \u03b42) such that inequality (cid:107)\u02dcg(\u03d5, \u03b42) \u2212 \u2207f (\u03d5)(cid:107)\u221e \u2264 \u03b42 holds.\n\u221a\n\u03b42 = \u03b4\n4R\nSimilar to [25], since f \u2208 C 1,1\nL ((cid:107) \u00b7 (cid:107)2), these two inequalities lead to Inequality 5.1 for \u02dcf (\u03d5, \u03b41) in\nthe role of \u02dcf (x, \u03b4), \u02dcg(\u03d5, \u03b42) in the role of \u02dcg(x, \u03b4) and (cid:107) \u00b7 (cid:107)2 in the role of (cid:107) \u00b7 (cid:107).\nWe choose the desired accuracy \u03b5 for approximating the stationary point of Problem 2.5. This\naccuracy gives the required accuracy \u03b4 of the inexact \ufb01rst-order oracle for f (\u03d5) on each step of the\ninner cycle of the Algorithm 2. Knowing the value \u03b41 = \u03b4\n2 and using Lemma 1, we choose the number\nof steps N of Method 3.1 and thus approximate f (\u03d5) with the required accuracy \u03b41 by \u02dcf (\u03d5, \u03b41).\n\u221a\nKnowing the value \u03b42 = \u03b4\nm and using Lemma 2, we choose the number of steps N1 of Method 3.1\n4R\nand the number of steps N2 of Method 3.2, 3.3 and obtain the approximation \u02dcg(\u03d5, \u03b42) of \u2207f (\u03d5) with\nthe required accuracy \u03b42. Then we use the inexact \ufb01rst-order oracle ( \u02dcf (\u03d5, \u03b41), \u02dcg(\u03d5, \u03b42)) to perform\na step of Algorithm 2. Since \u03a6 is the Euclidean ball, it is natural to set E = Rm and (cid:107) \u00b7 (cid:107) = (cid:107) \u00b7 (cid:107)2,\n\n6\n\n\f2(cid:107)\u03d5(cid:107)2\n\n2. Then the Bregman distance is V (\u03d5, \u03c9) = 1\n\nchoose the prox-function d(\u03d5) = 1\nAlgorithm 4 in Supplementary Materials is a formal record of the above ideas.\nThe most computationally consuming operations of the inner cycle of Algorithm 4 are calculations of\n\u02dcf (\u03d5k, \u03b41), \u02dcf (\u03c9k, \u03b41) and \u02dcg(\u03d5k, \u03b42). Using Lemma 1 and Lemma 2, we obtain the complexity of each\niteration. From Theorem 3 we obtain the following result, which gives the complexity of Algorithm\n4.\nTheorem 4. The total number of arithmetic operations in Algorithm 4 for the accuracy \u03b5 (i.e. for the\ninequality (cid:107)MK(\u03d5K \u2212 \u03d5K+1)(cid:107)2\n\n2(cid:107)\u03d5 \u2212 \u03c9(cid:107)2\n2.\n\n(cid:19)\n\n.\n\n\u221a\n\nm\n\n(cid:18) 8L(f (\u03d50) \u2212 f\u2217)\n\n\u03b5\n\n(cid:19)\n\n(cid:18)\n\n2 \u2264 \u03b5 to hold) is not more than\n6mps|Q|\n\n\u00b7\n\n7r|Q| +\n\n+ log2\n\n2L\nL0\n\nln\n\n1024\u03b21rRL\n\n\u03b1\u03b5\n\n\u03b1\n\n6 EXPERIMENTAL RESULTS\n\nIn this section, we compare our gradient-free and gradient-based methods with the state-of-the-art\ngradient-based method [21] on the web page ranking problem. In the next section, we describe the\ndataset. In Section 6.2, we report the results of the experiments.\n\n6.1 DATA\nWe consider the user web browsing graph \u0393q = (Vq, Eq), q \u2208 Q, introduced in [12]. Unlike a link\ngraph, a user browsing graph is query\u2013dependent. The set of vertices Vq consists of all different\npages visited by users during their sessions started from q. The set of directed edges Eq represents all\nthe ordered pairs of neighboring elements (\u02dci, i) from such sessions. We add a page i in the seed set\nUq if and only if there is a session where i is the \ufb01rst page visited after submitting query q.\nAll experiments are performed with data of a popular commercial search engine Yandex2. We chose\na random set of 600 queries Q and collected user sessions started with them. There are \u2248 11.7K\nvertices and \u2248 7.5K edges in graphs \u0393q, q \u2208 Q, in total. For each query, a set of pages was labelled\nby professional assessors with standard 5 relevance grades (\u2248 1.7K labeled query\u2013document pairs\nin total). We divide our data into two parts. On the \ufb01rst part Q1 (50% of the set of queries Q) we\ntrain the parameters and on the second part Q2 we test the algorithms. For each q \u2208 Q and i \u2208 Vq,\nvector Vq\nof m2 = 52\nfeatures for an edge \u02dci \u2192 i \u2208 Eq is obtained as the concatenation of Vq\nTo study a dependency between the ef\ufb01ciency of the algorithms and the sizes of the graphs, we sort\nthe sets Q1, Q2 in ascending order of sizes of the respective graphs. Sets Q1\nj contain \ufb01rst (in\nterms of these order) 100, 200, 300 elements respectively for j \u2208 {1, 2}.\n\ni of size m1 = 26 encodes features for query\u2013document pair (q, i). Vector Eq\n\u02dcii\n\nand Vq\ni .\n\n\u02dci\n\nj, Q2\n\nj, Q3\n\n6.2 PERFORMANCES OF THE OPTIMIZATION ALGORITHMS\n\nWe optimized the parameters \u03d5 by three methods: our gradient-free method GFN (Algorithm 2), the\ngradient-based method GBN (Algorithm 4), and the state-of-the-art gradient-method GBP. The values\nof hyperparameters are the following: the Lipschitz constant L = 10\u22124 in GFN (and L0 = 10\u22124\nin GBN), the accuracy \u03b5 = 10\u22126 (in both GBN and GFN), the radius R = 0.99 (in both GBN\nand GFN). On all sets of queries, we compare \ufb01nal values of the loss function for GBN when\nL0 \u2208 {10\u22124, 10\u22123, 10\u22122, 10\u22121, 1}. The differences are less than 10\u22127. We choose L in GFN to be\nequal to L0 (we show how the choice of L in\ufb02uences the output of the gradient-free algorithm, see\nsupplementary materials, Figure 2). Moreover, we evaluate both our gradient-based and gradient-free\nalgorithms for different values of the accuracies. The outputs of the algorithms differ insuf\ufb01ciently\n2, i \u2208 {1, 2, 3}, when \u03b5 \u2264 10\u22126. On the lower level of the state-of-the-art gradient-\non all test sets Qi\nbased algorithm, the stochastic matrix and its derivative are raised to the power 100. We evaluate\nGBP for different values of the step size (50, 100, 200, 500). We stop the GBP algorithms when the\ndifferences between the values of the loss function on the next step and the current step are less than\n\u221210\u22125 on the test sets.\n\n2yandex.com\n\n7\n\n\fIn Table 1, we present the performances of the optimization algorithms in terms of the loss function\nf (2.4). We also compare the algorithms with the untuned Supervised PageRank (\u03d5 = \u03d50 = em).\nOn Figure 1, we give the outputs of the optimization algorithms on each iteration of the upper levels\nof the learning processes on the test set Q3\n\n2, similar results were obtained for the sets Q1\n\n2.\n2, Q2\n\nQ1\n2\n\nloss\n\nsteps\n\nloss\n\nQ2\n2\n\nsteps\n\nQ3\n2\n\n.00354\n.00305\n.00297\n.00307\n.00307\n.00308\n.00308\n\n0\n12\n106\n31\n16\n7\n2\n\nloss\n.0033\n.00295\n.00292\n.00295\n.00295\n.00295\n.00295\n\nsteps\n\n0\n12\n106\n40\n20\n9\n3\n\nMeth.\nPR\nGBN\nGFN\n\nGBP 50s.\nGBP 100s.\nGBP 200s.\nGBP 500s.\n\n.00357\n.00279\n.00274\n.00282\n.00282\n.00283\n.00283\n\n0\n12\n106\n16\n8\n4\n2\n\nTable 1: Comparison of the algorithms on the test sets.\n\nFigure 1: Values of the loss function on each iteration of the optimization algorithms on the test set Q3\n2.\n\nGFN signi\ufb01cantly outperforms the state-of-the-art algorithms on all test sets. GBN signi\ufb01cantly\noutperforms the state-of-the-art algorithm on Q1\n2 (we obtain the p-values of the paired t-tests for all\nthe above differences on the test sets of queries, all these values are less than 0.005). However, GBN\nrequires less iterations of the upper level (until it stops) than GBP for step sizes 50 and 100 on Q2\n2.\n2, Q3\nFinally, we show that Nesterov\u2013Nemirovski method converges to the stationary distribution faster\nthan the power method (in supplementary materials, on Figure 2, we demonstrate the dependencies of\nthe value of the loss function on Q1\n1 for both methods of computing the untuned Supervised PageRank\n\u03d5 = \u03d50 = em).\n\n7 CONCLUSION\n\nWe propose a gradient-free optimization method for general convex problems with inexact zero-order\noracle and an adaptive gradient method for possibly nonconvex general composite optimization\nproblems with inexact \ufb01rst-order oracle. For both methods, we provide convergence rate analysis.\nWe also apply our new methods for known problem of learning a web-page ranking algorithm.\nOur new algorithms not only outperform existing algorithms, but also are guaranteed to solve this\nlearning problem. In practice, this means that these algorithms can increase the reliability and speed\nof a search engine. Also, to the best of our knowledge, this is the \ufb01rst time when the ideas of\nrandom gradient-free and gradient optimization methods are combined with some ef\ufb01cient method\nfor huge-scale optimization using the concept of an inexact oracle.\nAcknowledgments The research by P. Dvurechensky and A. Gasnikov presented in Section 4 of this paper was\nconducted in IITP RAS and supported by the Russian Science Foundation grant (project 14-50-00150), the\nresearch presented in Section 5 was supported by RFBR.\n\n8\n\n\fReferences\n[1] A. Agarwal, O. Dekel and L. Xiao, Optimal algorithms for online convex optimization with multi-point\n\nbandit feedback, 2010, 23rd Annual Conference on Learning Theory (COLT).\n\n[2] L. Backstrom and J. Leskovec, Supervised random walks: predicting and recommending links in social\n\nnetworks, 2011, WSDM.\n\n[3] Na Dai and Brian D. Davison, Freshness Matters: In Flowers, Food, and Web Authority, 2010, SIGIR.\n[4] N. Eiron, K. S. McCurley and J. A. Tomlin, Ranking the web frontier, 2004, WWW.\n[5] A. Gasnikov and D. Dmitriev, Ef\ufb01cient randomized algorithms for PageRank problem, Comp. Math. &\n\nMath. Phys, 2015, 55(3): 1\u201318.\n\n[6] B. Gao, T.-Y. Liu, W. W. Huazhong, T. Wang and H. Li, Semi-supervised ranking on very large graphs\n\nwith rich metadata, 2011, KDD.\n\n[7] S. Ghadimi, G. Lan, Stochastic \ufb01rst- and zeroth-order methods for nonconvex stochastic programming,\n\nSIAM Journal on Optimization, 2014, 23(4), 2341\u20132368.\n\n[8] T. H. Haveliwala, Ef\ufb01cient computation of PageRank, Stanford University, 1999.\n[9] T. H. Haveliwala, Topic-Sensitive PageRank, 2002, WWW.\n[10] G. Jeh and J. Widom, Scaling Personalized Web Search, 2003, WWW.\n[11] J. M. Kleinberg, Authoritative sources in a hyperlinked environment, 1998, SODA.\n[12] Y. Liu, B. Gao, T.-Y. Liu, Y. Zhang, Z. Ma, S. He, H. Li, BrowseRank: Letting Web Users Vote for Page\n\nImportance, 2008, SIGIR.\n\n[13] J. Matyas, Random optimization, Automation and Remote Control, 1965, 26: 246\u2013253.\n[14] Yu. Nesterov, Introductory Lectures on Convex Optimization, Springer, 2004, New York.\n[15] Yu. Nesterov, Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems, SIAM\n\nJournal on Optimization, 2012, 22(2): 341\u2013362.\n\n[16] Yu. Nesterov and V. Spokoiny, Random Gradient-Free Minimization of Convex Functions, Foundations of\n\nComputational Mathematics, 2015, 1\u201340.\n\n[17] Yu. Nesterov and A. Nemirovski, Finding the stationary states of Markov chains by iterative methods,\n\nApplied Mathematics and Computation, 2015, 255: 58\u201365.\n\n[18] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: Bringing order to the web,\n\nStanford InfoLab, 1999.\n\n[19] M. Richardson and P. Domingos, The intelligent surfer: Probabilistic combination of link and content\n\ninformation in PageRank, 2002, NIPS.\n\n[20] M. Zhukovskii, G. Gusev, P. Serdyukov, URL Redirection Accounting for Improving Link-Based Ranking\n\nMethods, 2013, ECIR.\n\n[21] M. Zhukovskii, G. Gusev, P. Serdyukov, Supervised Nested PageRank, 2014, CIKM.\n[22] M. Zhukovskii, A. Khropov, G. Gusev, P. Serdyukov, Fresh BrowseRank, 2013, SIGIR.\n[23] A. L. Andrew, Convergence of an iterative method for derivatives of eigensystems, Journal of Computational\n\nPhysics, 1978, 26: 107\u2013112.\n\n[24] A. Andrew, Iterative computation of derivatives of eigenvalues and eigenvectors, IMA Journal of Applied\n\nMathematics, 1979, 24(2): 209\u2013218.\n\n[25] O. Devolder, F. Glineur, Yu. Nesterov, First-order methods of smooth convex optimization with inexact\n\noracle, Mathematical Programming, 2013, 146(1): 37\u201375.\n\n[26] Yu. Nesterov, B.T. Polyak, Cubic regularization of Newton method and its global performance, Mathemati-\n\ncal Programming, 2006, 108(1) 177\u2013205.\n\n[27] Yu. Nesterov, Gradient methods for minimizing composite functions, Mathematical Programming, 2012,\n\n140(1) 125\u2013161.\n\n9\n\n\f", "award": [], "sourceid": 2479, "authors": [{"given_name": "Lev", "family_name": "Bogolubsky", "institution": "Yandex, Moscow State University"}, {"given_name": "Pavel", "family_name": "Dvurechenskii", "institution": "Weierstrass Institute for Appl"}, {"given_name": "Alexander", "family_name": "Gasnikov", "institution": "SkolTech"}, {"given_name": "Gleb", "family_name": "Gusev", "institution": "Yandex LLC"}, {"given_name": "Yurii", "family_name": "Nesterov", "institution": "Catholic University of Louvain (UCL)"}, {"given_name": "Andrei", "family_name": "Raigorodskii", "institution": "Moscow Institute of Physics and Technology"}, {"given_name": "Aleksey", "family_name": "Tikhonov", "institution": "Yandex"}, {"given_name": "Maksim", "family_name": "Zhukovskii", "institution": null}]}