{"title": "Beyond the Birkhoff Polytope: Convex Relaxations for Vector Permutation Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 2168, "page_last": 2176, "abstract": "The Birkhoff polytope (the convex hull of the set of permutation matrices), which is represented using $\\Theta(n^2)$ variables and constraints, is frequently invoked in formulating relaxations of optimization problems over permutations. Using a recent construction of Goemans (2010), we show that when optimizing over the convex hull of the permutation vectors (the permutahedron), we can reduce the number of variables and constraints to $\\Theta(n \\log n)$ in theory and $\\Theta(n \\log^2 n)$ in practice. We modify the recent convex formulation of the 2-SUM problem introduced by Fogel et al. (2013) to use this polytope, and demonstrate how we can attain results of similar quality in significantly less computational time for large $n$. To our knowledge, this is the first usage of Goemans' compact formulation of the permutahedron in a convex optimization problem. We also introduce a simpler regularization scheme for this convex formulation of the 2-SUM problem that yields good empirical results.", "full_text": "Beyond the Birkhoff Polytope:\n\nConvex Relaxations for Vector Permutation Problems\n\nCong Han Lim\n\nDepartment of Computer Sciences\nUniversity of Wisconsin - Madison\n\nMadison, WI 53706\n\nconghan@cs.wisc.edu\n\nStephen J. Wright\n\nDepartment of Computer Sciences\nUniversity of Wisconsin - Madison\n\nMadison, WI 53706\n\nswright@cs.wisc.edu\n\nAbstract\n\nThe Birkhoff polytope (the convex hull of the set of permutation matrices), which\nis represented using \u0398(n2) variables and constraints, is frequently invoked in for-\nmulating relaxations of optimization problems over permutations. Using a recent\nconstruction of Goemans [1], we show that when optimizing over the convex hull\nof the permutation vectors (the permutahedron), we can reduce the number of\nvariables and constraints to \u0398(n log n) in theory and \u0398(n log2 n) in practice. We\nmodify the recent convex formulation of the 2-SUM problem introduced by Fogel\net al. [2] to use this polytope, and demonstrate how we can attain results of similar\nquality in signi\ufb01cantly less computational time for large n. To our knowledge, this\nis the \ufb01rst usage of Goemans\u2019 compact formulation of the permutahedron in a con-\nvex optimization problem. We also introduce a simpler regularization scheme for\nthis convex formulation of the 2-SUM problem that yields good empirical results.\n\n1\n\nIntroduction\n\nA typical work\ufb02ow for converting a discrete optimization problem over the set of permutations of n\nobjects into a continuous relaxation is as follows: (1) use permutation matrices to represent permu-\ntations; (2) relax to the convex hull of the set of permutation matrices \u2014 the Birkhoff polytope; (3)\nrelax other constraints to ensure convexity/continuity. Instances of this procedure appear in [3, 2].\nRepresentation of the Birkhoff polytope requires \u0398(n2) variables, signi\ufb01cantly more than the n\nvariables required to represent the permutation directly. The increase in dimension is unappealing,\nespecially if we are only interested in optimizing over permutation vectors, as opposed to permuta-\ntions of a more complex object, such as a graph. The obvious alternative of using a relaxation based\non the convex hull of the set of permutations (the permutahedron) is computationally infeasible,\nbecause the permutahedron has exponentially many facets (whereas the Birkhoff polytope has only\nn2 facets). We can achieve a better trade-off between the number of variables and facets by using\nsorting networks to construct polytopes that can be linearly projected to recover the permutahedron.\nThis construction, introduced by Goemans [1], can have as few as \u0398(n log n) facets, which is op-\ntimal up to constant factors. In this paper, we use a relaxation based on these polytopes, which we\ncall \u201csorting network polytopes.\u201d\nWe apply the sorting network polytope to the noisy seriation problem, de\ufb01ned as follows. Given\na noisy similarity matrix A, recover a symmetric row/column ordering of A for which the entries\ngenerally decrease with distance from the diagonal. Fogel et al. [2] introduced a convex relaxation\nof the 2-SUM problem to solve the noisy seriation problem. They proved that the solution to the 2-\nSUM problem recovers the exact solution of the seriation problem in the \u201cnoiseless\u201d case (in which\nan ordering exists that ensures monotonic decrease of similarity measures with distance from the\ndiagonal). They further show that the formulation allows side information about the ordering to be\nincorporated, and is more robust to noise than a spectral formulation of the 2-SUM problem de-\n\n1\n\n\fscribed by Atkins et al. [4]. The formulation in [2] makes use of the Birkhoff polytope. We propose\ninstead a formulation based on the sorting network polytope. Performing convex optimization over\nthe sorting network polytope requires different techniques from those described in [2]. In addition,\nwe describe a new regularization scheme, applicable both to our formulation and that of [2], that is\nmore natural for the 2-SUM problem and has good practical performance.\nThe paper is organized as follows. We begin by describing polytopes for representing permutations\nin Section 2. In Section 3, we introduce the seriation problem and the 2-SUM problem, describe two\ncontinuous relaxations for the latter, (one of which uses the sorting network polytope) and introduce\nour regularization scheme for strengthening the relaxations. Issues that arise in using the sorting\nnetwork polytope are discussed in Section 4. In Section 5, we provide experimental results showing\nthe effectiveness of our approach. The extended version of this paper [5] includes some additional\ncomputational results, along with several proofs. It also describes an ef\ufb01cient algorithm for taking a\nconditional gradient step for the convex formulation, for the case in which the formulation contains\nno side information.\n\n2 Permutahedron, Birkhoff Polytope, and Sorting Networks\n\nWe use n throughout the paper to refer to the length of the permutation vectors. \u03c0In = (1, 2, . . . , n)T\ndenotes the identity permutation. (When the size n can be inferred from the context, we write the\nidentity permutation as \u03c0I.) P n denotes the set of all permutations vectors of length n. We use\n\u03c0 \u2208 P n to denote a generic permutation, and denote its components by \u03c0(i), i = 1, 2, . . . , n. We\nuse 1 to denote the vector of length n whose components are all 1.\nDe\ufb01nition 2.1. The permutahedron PHn, the convex hull of P n, is de\ufb01ned as follows:\n\nPHn :=\n\nxi =\n\nn(n + 1)\n\n2\n\n,\n\nxi \u2264\n\n(n + 1 \u2212 i) for all S \u2282 [n]\n\nThe permutahedron PHn has 2n\u22122 facets, which prevents us from using it in optimization problems\ndirectly. (We should note however that the permutahedron is a submodular polyhedron and hence\nadmits ef\ufb01cient algorithms for certain optimization problems.) Relaxations are commonly derived\nfrom the set of permutation matrices (the set of n \u00d7 n matrices containing zeros and ones, with a\nsingle one in each row and column) and its convex hull instead.\nDe\ufb01nition 2.2. The convex hull of the set of n\u00d7n permutation matrices is the Birkhoff polytope Bn,\nwhich is the set of all doubly-stochastic n\u00d7 n matrices {X \u2208 Rn\u00d7n | X \u2265 0, X1 = 1, X T 1 = 1}.\nThe Birkhoff polytope has been widely used in the machine learning and computer vision com-\nmunities for various permutation problems (see for example [2], [3]). The permutahedron can be\nj=1 j \u00b7 Xij. The\n\nrepresented as the projection of the Birkhoff polytope from Rn\u00d7n to Rn by xi =(cid:80)n\n\nBirkhoff polytope is sometimes said to be an extended formulation of the permutahedron.\nA natural question to ask is whether a more compact extended formulation exists for the permuta-\nhedron. Goemans [1] answered this question in the af\ufb01rmative by constructing one with \u0398(n log n)\nconstraints and variables, which is optimal up to constant factors. His construction is based on sort-\ning networks, a collection of wires and binary comparators that sorts a list of numbers. Figure 1\ndisplays a sorting network on 4 variables. (See [6] for further information on sorting networks.)\nGiven a sorting network on n inputs with m comparators (we will subsequently always use m to\nrefer to the number of comparators), an extended formulation for the permutahedron with O(m)\nvariables and constraints can be constructed as follows [1]. Referring to the notation in the right\nsub\ufb01gure in Figure 1, we introduce a set of constraints for each comparator k = 1, 2, . . . , m to\nindicate the relationships between the two inputs and the two outputs of each comparator:\n(out, top) \u2264 xk\nxk\n\n(1)\nNote that these constraints require the sum of the two inputs to be the same as the sum of the two\noutputs, but the inputs can be closer together than the outputs. Let xin\n, i = 1, 2, . . . , n\ndenote the x variables corresponding to the ith input and ith output of the entire sorting network,\nrespectively. We introduce the additional constraints\n\n(out, top) \u2264 xk\nxk\n\ni and xout\n\n(in, bot).\n\nxk\n(in, top) + xk\n\n(in, bot) = xk\n\n(out, top) + xk\n\n(out, bot),\n\n(in, top),\n\nand\n\ni\n\n(2)\n\n\uf8f1\uf8f2\uf8f3x \u2208 Rn\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88)\n\ni=1\n\n(cid:88)\n\ni\u2208S\n\n|S|(cid:88)\n\ni=1\n\n\uf8fc\uf8fd\uf8fe .\n\ni = i, for i \u2208 [n].\nxout\n\n2\n\n\fFigure 1: A bitonic sorting network on 4 variables (left) and the k-th comparator (right). The input\nto the sorting network is on the left and the output is on the right. At each comparator, we take the\ntwo input values and sort them such that the smaller value is the one at the top in the output. Sorting\ntakes place progressively as we move from left to right through the network, sorting pairs of values\nas we encounter comparators.\n\nThe details of this construction depend on the particular choice of sorting network (see Section 4),\nbut we will refer to it generically as the sorting network polytope SN n. Each element in this\npolytope can be viewed as a concatenation of two vectors: the subvector associated with the network\nn ), and the rest of the coordinates xrest, which includes all the internal\ninputs xin = (xin\nvariables as well as the outputs. The following theorem attests to the fact that any input vector xin\nvector that is part of a feasible vector for the entire network is a point in the permutahedron:\nTheorem 2.3 (Goemans [1]). The set {xin | (xin, xrest) \u2208 SN n} is the permutahedron PHn.\n\n2 , . . . , xin\n\n1 , xin\n\n3 Convex Relaxations of 2-SUM via Sorting Network Polytope\n\nIn this section we will brie\ufb02y describe the seriation problem, and some of the continuous relaxations\nof the combinatorial 2-SUM problem that can be used to solve this problem.\n\nThe Noiseless Seriation Problem. The term seriation generally refers to data analysis techniques\nthat arrange objects in a linear ordering in a way that \ufb01ts available information and thus reveals\nunderlying structure of the system [7]. We adopt here the de\ufb01nition of the seriation problem from\n[4]. Suppose we have n objects arranged along a line, and a similarity function that increases with\ndistance between objects in the line. The similarity matrix is the symmetric n \u00d7 n matrix whose\n(i, j) entry is the similarity measure between the ith and jth objects in the linear arrangement. This\nsimilarity matrix is a R-matrix, according to the following de\ufb01nition.\nDe\ufb01nition 3.1. A symmetric matrix A is a Robinson matrix (R-matrix) if for all points (i, j) where\ni > j, we have Aij \u2264 min(A(i\u22121),j, Ai,(j+1)). A symmetric matrix A is a pre-R matrix if \u03a0T A\u03a0 is\nR for some permutation \u03a0.\n\nIn other words, a symmetric matrix is a R-matrix if the entries are nonincreasing as we move away\nfrom the diagonal in either the horizontal or vertical direction. The goal of the noiseless seriation\nproblem is to recover the ordering of the variables along the line from the pairwise similarity data,\nwhich is equivalent to \ufb01nding the permutation that recovers an R-matrix from a pre-R-matrix.\nThe seriation problem was introduced in the archaeology literature [8], and has applications across\na wide range of areas including clustering [9], shotgun DNA sequencing [2], and taxonomy [10].\nR-matrices are useful in part because of their relation to the consecutive-ones property in a matrix\nof zeros and ones, where the ones in each column form a contiguous block. A matrix M with the\nconsecutive-ones property gives rise to a R-matrix M M T .\n\nNoisy Seriation, 2-SUM and Continuous Relaxations. Given a binary symmetric matrix A, the\n2-SUM problem can be expressed as follows:\n\nA slightly simpler but equivalent formulation, de\ufb01ned via the Laplacian LA = diag(A1) \u2212 A, is\n\n(3)\n\n(4)\n\nn(cid:88)\n\nn(cid:88)\n\nAij(\u03c0(i) \u2212 \u03c0(j))2.\n\nmin\n\u03c0\u2208P n\n\ni=1\n\nj=1\n\nmin\n\u03c0\u2208P n\n\n\u03c0T LA\u03c0.\n\n3\n\nxin4xin3xin2xin1xout4xout3xout2xout1xk(in,bottom)xk(in,top)xk(out,bottom)xk(out,top)\fThe seriation problem is closely related to the combinatorial 2-SUM problem, and Fogel et al. [2]\nproved that if A is a pre-R-matrix such that each row/column has unique entries, then the solution to\nthe 2-SUM problem also solves the noiseless seriation problem. In another relaxation of the 2-SUM\nproblem, Atkins et al. [4] demonstrate that \ufb01nding the second smallest eigenvalue, also known as the\nFiedler value, solves the noiseless seriation problem. Hence, the 2-SUM problem provides a good\nmodel for the noisy seriation problem, where the similarity matrices are close to, but not exactly,\npre-R matrices.\nThe 2-SUM problem is known to be N P -hard [11], so we seek ef\ufb01cient relaxations. We describe\nbelow two continuous relaxations that are computationally practical. (Other relaxations of these\nproblems require solution of semide\ufb01nite programs and are intractable in practice for large n.)\nThe spectral formulation of [4] seeks the Fiedler value by searching over the space orthogonal to the\nvector 1, which is the eigenvector that corresponds to the zero eigenvalue. The Fiedler value is the\noptimal objective value of the following problem:\nsuch that\n\nyT 1 = 0, (cid:107)y(cid:107)2 = 1.\n\nyT LAy\n\n(5)\n\nmin\ny\u2208Rn\n\nThis problem is non-convex, but its solution can be found ef\ufb01ciently from an eigenvalue decom-\nposition of LA. With Fiedler vector y, one can obtain a candidate solution to the 2-SUM problem\nby picking the permutation \u03c0 \u2208 P n to have the same ordering as the elements of y. The spectral\nformulation (5) is a continuous relaxation of the 2-SUM problem (4).\nThe second relaxation of (4), described by Fogel et al. [2], makes use of the Birkhoff polytope Bn.\nThe basic version of the formulation is\n\nmin\n\u03a0\u2208Bn\n\n\u03c0T\nI \u03a0T LA\u03a0\u03c0I ,\n\n(6)\n\n(recall that \u03c0I is the identity permutation (1, 2, . . . , n)T ), which is a convex quadratic program over\nthe n2 components of \u03a0. Fogel et al. augment and enhance this formulation as follows.\n\n1 \u03a0\u03c0I + 1 \u2264 eT\n\ndirection of the ordering, where ek = (0, . . . , 0, 1, 0, . . . , 0)T with the 1 in position k.\n\n\u2022 Introduce a \u201ctiebreaking\u201d constraint eT\n\u2022 Average over several perturbations of \u03c0I to improve robustness of the solution.\n\u2022 Add a penalty to maximize the Frobenius norm of the matrix \u03a0, which pushes the solution\n\u2022 Incorporate additional ordering constraints of the form xi \u2212 xj \u2264 \u03b4k, to exploit prior\n\nn \u03a0\u03c0I to resolve ambiguity about the\n\ncloser to a vertex of the Birkhoff polytope.\n\nknowledge about the ordering.\n\nWith these modi\ufb01cations, the problem to be solved is\n\n1\np\n\nF\n\nmin\n\u03a0\u2208Bn\n\n(cid:107)P \u03a0(cid:107)2\n\nsuch that D\u03a0\u03c0I \u2264 \u03b4,\n\nTrace(Y T \u03a0T LA\u03a0Y ) \u2212 \u00b5\np\n\nF rather than to (cid:107)\u03a0(cid:107)2\n\n(7)\nwhere each column of Y \u2208 Rn\u00d7p is a slightly perturbed version of a permutation,1 \u00b5 is the regu-\nlarization coef\ufb01cient, the constraint D\u03a0\u03c0I \u2264 \u03b4 contains the ordering information and tiebreaking\nconstraints, and the operator P = I \u2212 1\nn 11T is the projection of \u03a0 onto elements orthogonal to\nthe all-ones matrix. The penalization is applied to (cid:107)P \u03a0(cid:107)2\nF directly, thus en-\nsuring that the program remains convex if the regularization factor is suf\ufb01ciently small (for which\na suf\ufb01cient condition is \u00b5 < \u03bb2(LA)\u03bb1(Y Y T )). We will refer to this regularization scheme as\nthe matrix-based regularization, and to the formulation (7) as the matrix-regularized Birkhoff-based\nconvex formulation.\nFigure 2 illustrates the permutahedron in the case of n = 3, and compares minimization of the objec-\ntive yT LAy over the permutahedron (as attempted by the convex formulation) with minimization of\nthe same objective over the constraints in the spectral formulation (5). The spectral method returns\ngood solutions when the noise is low, and it is computationally ef\ufb01cient since there are many fast\nalgorithms and software for obtaining selected eigenvectors. However, the Birkhoff-based convex\nformulation can return a solution that is signi\ufb01cantly better in situations with high noise or sig-\nni\ufb01cant additional ordering information. For the rest of this section, we will focus on the convex\nformulation.\n\n1In [2], each column of Y is said to contain a perturbation of \u03c0I, but in a response to referees of their paper,\n\nthe authors say that they used sorted uniform random vectors instead in the revised version.\n\n4\n\n\fFigure 2: A geometric interpretation of spectral and convex formulation solutions on the 3-\npermutahedron. The left image shows the 3-permutahedron in 3D space and the dashed line shows\nthe eigenvector 1 corresponding to the zero eigenvalue. The right image shows the projection of\nthe 3-permutahedron along the trivial eigenvector together with the elliptical level curves of the\nobjective function yT LAy. Points on the circumscribed circle have an (cid:96)2-norm equal to that of a\npermutation, and the objective is minimized over this circle at the point denoted by a cross. The\nvertical line in the right \ufb01gure enforces the tiebreaking constraint that 1 must appear before 3 in the\nordering; the red dot indicates the minimizer of the objective over the resulting triangular feasible\nregion. Without the tiebreaking constraint, the minimizer is at the center of the permutahedron.\n\nA Compact Convex Relaxation via the Permutahedron/Sorting Network Polytope and a New\nRegularization Scheme. We consider now a different relaxation for the 2-SUM problem (4). Tak-\ning the convex hull of P n directly, we obtain\nmin\nx\u2208PHn\n\nxT LAx.\n\n(8)\n\nThis is essentially a permutahedron-based version of (6). In fact, two problems are equivalent, except\nthat formulation (8) is more compact when we enforce x \u2208 PH via the sorting network constraints\n\nx \u2208 {xin | (xin, xrest) \u2208 SN n},\n\nwhere SN n incorporates the comparator constraints (1) and the output constraints (2). This for-\nmulation can be enhanced and augmented in a similar fashion to (6). The tiebreaking constraint\nfor this formulation can be expressed simply as x1 + 1 \u2264 xn, since xin consists of the subvec-\ntor (x1, x2, . . . , xn). (In both (8) and (6), having at least one additional constraint is necessary to\nremove the trivial solution given by the center of the permutahedron or Birkhoff polytope; see Fig-\nure 2.) This constraint is the strongest inequality that will not eliminate any permutation (assuming\nthat a permutation and its reverse are equivalent); we include a proof of this fact in [5].\nIt is also helpful to introduce a penalty to force the solution x to be closer to a permutation, that is, a\nvertex of the permutahedron. To this end, we introduce a vector-based regularization scheme. The\nfollowing statement is an immediate consequence of strict convexity of norms.\nProposition 3.2. Let v \u2208 Rn, and let X be the convex hull of all permutations of v. Then, the points\nin X with the highest (cid:96)p norm, for 1 < p < \u221e, are precisely the permutations of v.\nIt follows that adding a penalty to encourage (cid:107)x(cid:107)2 to be large might improve solution quality. How-\never, directly penalizing the negative of the 2-norm of x would destroy convexity, since LA has a\nzero eigenvalue. Instead we penalize P x, where P = I \u2212 1\nn 11T projects onto the subspace orthog-\nonal to the trivial eigenvector 1. (Note that this projection of the permutahedron still satis\ufb01es the\nassumptions of Proposition 3.2.) When we include a penalty on (cid:107)P x(cid:107)2\n2 in the formulation (8) along\nwith side constraints Dx \u2264 \u03b4 on the ordering, we obtain the objective xT LAx \u2212 \u00b5(cid:107)P x(cid:107)2\n2 which\nleads to\n\n(9)\nThis objective is convex when \u00b5 \u2264 \u03bb2(LA), a looser condition on \u00b5 than is the case in matrix-based\nregularization. We will refer to (9) as the regularized permutahedron-based convex formulation.\n\nxT (LA \u2212 \u00b5P )x such that Dx \u2264 \u03b4.\n\nmin\nx\u2208PHn\n\n5\n\nxyz\fVector-based regularization can also be incorporated into the Birkhoff-based convex formulation.\nInstead of maximizing the (cid:107)P \u03a0(cid:107)2\n2 term in formulation (7) to force the solution to be closer to a per-\nmutation, we could maximize (cid:107)P \u03a0Y (cid:107)2\n2. The vector-regularized version of (6) with side constraints\ncan be written as follows:\n\nmin\n\u03a0\u2208Bn\n\n1\np\n\nTrace(YT\u03a0T(LA \u2212 \u00b5P)\u03a0Y)\n\nsuch that D\u03a0\u03c0I \u2264 \u03b4.\n\n(10)\n\nWe refer to this formulation as the vector-regularized Birkhoff-based convex formulation. Vector-\nbased regularization is in some sense more natural than the regularization in (7). It acts directly\non the set that we are optimizing over, rather than an expanded set. The looser condition \u00b5 \u2264\n\u03bb2(LA) allows for stronger regularization. Experiments reported in [5] show that the vector-based\nregularization produces permutations that are consistently better those obtained from the Birkhoff-\nbased regularization.\nThe regularized permutahedron-based convex formulation is a convex QP with O(m) variables and\nconstraints, where m is the number of comparators in its sorting network, while the Birkhoff-based\none is a convex QP with O(n2) variables. The one feature in the Birkhoff-based formulations that\nthe permutahedron-based formulations do not have is the ability to average the solution over multiple\nvectors by choosing p > 1 columns in the matrix Y \u2208 Rn\u00d7p. However, our experiments suggested\nthat the best solutions were obtained for p = 1, so this consideration was not important in practice.\n\n4 Key Implementation Issues\n\nChoice of Sorting Network. There are numerous possible choices of the sorting network, from\nwhich the constraints in formulation (9) are derived. The asymptotically most compact option is\nthe AKS sorting network, which contains \u0398(n log n) comparators. This network was introduced in\n[12] and subsequently improved by others, but is impractical because of its dif\ufb01culty of construc-\ntion and the large constant factor in the complexity expression. We opt instead for more elegant\nnetworks with slightly worse asymptotic complexity. Batcher [13] introduced two sorting networks\nwith \u0398(n log2 n) size \u2014 the odd-even sorting network and the bitonic sorting network \u2014 that are\npopular and practical. The sorting network polytope based on these can be generated with a simple\nrecursive algorithm in \u0398(n log2 n) time.\n\nObtaining Permutations from a Point in the Permutahedron. Solution of the permutation-\nbased relaxation yields a point x in the permutahedron, but we need techniques to convert this point\ninto a valid permutation, which is a candidate solution for the 2-SUM problem (3). The most obvi-\nous recovery technique is to choose this permutation \u03c0 to have the same ordering as the elements of\nx, that is, xi < xj implies \u03c0(i) < \u03c0(j), for all i, j \u2208 {1, 2, . . . , n}. We could also sample multiple\npermutations, by applying Gaussian noise to the components of x prior to taking the ordering to pro-\nduce \u03c0. (We used i.i.d. noise with variance 0.5.) The 2-SUM objective (3) can be evaluated for each\npermutation so obtained, with the best one being reported as the overall solution. This inexpensive\nrandomized recovery procedure can be repeated many times, and it yield signi\ufb01cantly better results\nover the single \u201cobvious\u201d ordering.\n\nSolving the Convex Formulation. On our test machine using the Gurobi interior point solver,\nwe were able to solve instances of the permutahedron-based convex formulation (9) of size up to\naround n = 10000. As in [2], \ufb01rst-order methods can be employed when the scale is larger. In [5],\nwe provide an optimal O(n log n) algorithm for step (1), in the case in which only the tiebreaking\nconstraint is present, with no additional ordering constraints.\n\n5 Experiments\n\nWe compare the run time and solution quality of algorithms on the two classes of convex formula-\ntions \u2014 Birkhoff-based and permutahedron-based \u2014 with various parameters. Summary results are\npresented in this section. Additional results, including more extensive experiments comparing the\neffects of different parameters on the solution quality, appear in [5].\n\n6\n\n\fExperimental Setup. The experiments were run on an Intel Xeon X5650 (24 cores @ 2.66Ghz)\nserver with 128GB of RAM in MATLAB 7.13, CVX 2.0 ([14],[15]), and Gurobi 5.5 [16]. We\ntested four formulation-algorithm-implementation variants, as follows. (i) Spectral method using the\nMATLAB eigs function, (ii) MATLAB/Gurobi on the permutahedron-based convex formulation,\n(iii) MATLAB/Gurobi on the Birkhoff-based convex formulation with p = 1 (that is, formulation\n(7) with Y = \u03c0I), and (iv) Experimental MATLAB code provided to us by the authors of [2]\nimplementing FISTA, for solving the matrix-regularized Birkhoff-based convex formulation (7),\nwith projection steps solved using block coordinate ascent on the dual problem. This is the current\nstate-of-the-art algorithm for large instances of the Birkhoff-based convex formulation; we refer\nto it as RQPS (for \u201cRegularized QP for Seriation\u201d). We report run time data using wall clock time\nreported by Gurobi, and MATLAB timings for RQPS, excluding all preprocessing time. We used the\nbitonic sorting network by Batcher [13] for experiments with the permutahedron-based formulation.\n\nLinear Markov Chain. The Markov chain reordering problem [2] involves recovering the order-\ning of a simple Markov chain with Gaussian noise from disordered samples. The Markov chain\nconsists of random variables X1, X2, . . . , Xn such that Xi = bXi\u22121 + \u0001i, where b is a positive\nconstant and \u0001i \u223c N (0, \u03c32). A sample covariance matrix taken over multiple independent samples\nof the Markov chain with permuted labels is used as the similarity matrix in the 2-SUM problem.\nWe use this problem for two different comparisons. First, we compare the solution quality and\nrunning time of RQPS algorithm of [2] with the Gurobi interior-point solver on the regularized\npermutahedron-based convex formulation, to demonstrate the performance of the formulation and\nalgorithm introduced in this paper compared with the prior state of the art. Second, we apply Gurobi\nto both the permutahedron-based and Birkhoff-based formulations with p = 1, with the goal of\ndiscovering which formulation is more ef\ufb01cient in practice.\nFor both sets of experiments, we \ufb01xed b = 0.999 and \u03c3 = 0.5 and generate 50 chains to form a sam-\nple covariance matrix. We chose n \u2208 {500, 2000, 5000} to see how algorithm performance scales\nwith n. For each n, we perform 10 independent runs, each based on a different set of samples of the\nMarkov chain (and hence a different sample covariance matrix). We added n ordering constraints\nfor each run. Each ordering constraint is of the form xi + \u03c0(j)\u2212 \u03c0(i) \u2264 xj, where \u03c0 is the (known)\npermutation that recovers the original matrix, and i, j \u2208 [n] is a pair randomly chosen but satisfying\n\u03c0(j) \u2212 \u03c0(i) > 0. We used a regularization parameter of \u00b5 = 0.9\u03bb2(LA) on all formulations.\n\nRQPS and the Permutahedron-Based Formulation. We compare the RQPS code for the matrix-\nregularized Birkhoff-based convex formulation (7) to the regularized permutahedron-based convex\nformulation, solved with Gurobi. We \ufb01xed a time limit for each value of n, and ran the RQPS\nalgorithm until the limit was reached. At \ufb01xed time intervals, we query the current solution and\nsample permutations from that point.\n\nFigure 3: Plot of 2-SUM objective over time (in seconds) for n \u2208 {500, 2000, 5000}. We choose the\nrun (out of ten) that shows the best results for RQPS relative to the interior-point algorithm for the\nregularized permutahedron-based formulation. We test four different variants of RQPS. The curves\nrepresent performance of the RQPS code for varying values of p (1 for red/green and n for blue/cyan)\nand the cap on the maximum number of iterations for the projection step (10 for red/blue and 100 for\ngreen/cyan). The white square represents the spectral solution, and the magenta diamond represents\nthe solution returned by Gurobi for the permutahedron-based formulation. The horizontal axis in\neach graph is positioned at the 2-SUM objective corresponding to the permutation that recovers the\noriginal labels for the sample covariance matrix.\n\n7\n\n\fFor RQPS, with a cap of 10 iterations within each projection step, the objective tends to descend\nrapidly to a certain level, then \ufb02uctuates around that level (or gets slightly worse) for the rest of the\nrunning time. For a limit of 100 iterations, there is less \ufb02uctuation in 2-SUM value, but it takes some\ntime to produce a solution as good as the previous case. In contrast to experience reported in [2],\nvalues of p greater than 1 do not seem to help; our runs for p = n plateaued at higher values of the\n2-SUM objective than those with p = 1.\nIn most cases, the regularized permutahedron-based formulation gives a better solution value than\nthe RQPS method, but there are occasional exceptions to this trend. For example, in the third run for\nn = 500 (the left plot in Figure 3), one variant of RQPS converges to a solution that is signi\ufb01cantly\nbetter. Despite its very fast runtimes, the spectral method does not yield solutions of competitive\nquality, due to not being able to make use of the side constraint information.\n\nDirect Comparison of Birkhoff and Permutahedron Formulations For the second set of exper-\niments, we compare the convergence rate of the objective value in the Gurobi interior-point solver\napplied to two equivalent formulations: the vector-regularized Birkhoff-based convex formulation\n(10) with p = 1 and the regularized permutahedron-based convex formulation (9). For each choice\nof input matrix and sampled ordering information, we ran the Gurobi interior-point method In Fig-\nure 4, we plot at each iteration the difference between the primal objective and v.\n\nFigure 4: Plot of the difference of the 2-SUM objective from the baseline objective over time (in\nseconds) for n \u2208 {2000, 5000}. The red curve represents performance of the permutahedron-based\nformulation; the blue curve represents the performance of the Birkhoff-based formulation. We dis-\nplay the best run (out of ten) for the Birkhoff-based formulation for each n. When n = 2000, the\npermutahedron-based formulation converges slightly faster in most cases. However, once we scale\nup to n = 5000, the permutahedron-based formulation converges signi\ufb01cantly faster in all tests.\n\nOur comparisons show that the permutahedron-based formulation tends to yield better solutions in\nfaster times than Birkhoff-based formulations, regardless of the algorithm used to solve the latter.\nThe advantage of the permutahedron-based formulation is more pronounced when n is large.\n\n6 Future Work and Acknowledgements\n\nWe hope that this paper spurs further interest in using sorting networks in the context of other more\ngeneral classes of permutation problems, such as graph matching or ranking. A direct adaptation of\nthis approach is inadequate, since the permutahedron does not uniquely describe a convex combina-\ntion of permutations, which is how the Birkhoff polytope is used in many such problems. However,\nwhen the permutation problem has a solution in the Birkhoff polytope that is close to an actual per-\nmutation, we should expect that the loss of information when projecting this point in the Birkhoff\npolytope to the permutahedron to be insigni\ufb01cant.\nWe thank Okan Akalin and Taedong Kim for helpful comments and suggestions for the experiments.\nWe thank the anonymous referees for feedback that improved the paper\u2019s presentation. We also thank\nthe authors of [2] for sharing their experimental code, and Fajwal Fogel for helpful discussions.\nLim\u2019s work on this project was supported in part by NSF Awards DMS-0914524 and DMS-1216318,\nand a grant from ExxonMobil. Wright\u2019s work was supported in part by NSF Award DMS-1216318,\nONR Award N00014-13-1-0129, DOE Award DE-SC0002283, AFOSR Award FA9550-13-1-0138,\nand Subcontract 3F-30222 from Argonne National Laboratory.\n\n8\n\n\fReferences\n[1] M. Goemans, \u201cSmallest compact formulation for the permutahedron,\u201d Working Paper, 2010.\n[2] F. Fogel, R. Jenatton, F. Bach, and A. D\u2019Aspremont, \u201cConvex Relaxations for Permutation\n\nProblems,\u201d in Advances in Neural Information Processing Systems, 2013, pp. 1016\u20131024.\n\n[3] M. Fiori, P. Sprechmann, J. Vogelstein, P. Muse, and G. Sapiro, \u201cRobust Multimodal Graph\nMatching: Sparse Coding Meets Graph Matching,\u201d in Advances in Neural Information Pro-\ncessing Systems, 2013, pp. 127\u2013135.\n\n[4] J. E. Atkins, E. G. Boman, and B. Hendrickson, \u201cA Spectral Algorithm for Seriation and the\nConsecutive Ones Problem,\u201d SIAM Journal on Computing, vol. 28, no. 1, pp. 297\u2013310, Jan.\n1998.\n\n[5] C. H. Lim and S. J. Wright, \u201cBeyond the Birkhoff Polytope: Convex Relaxations for Vector\n\nPermutation Problems,\u201d arXiv:1407.6609, 2014.\n\n[6] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introduction to Algorithms, 2nd ed.\n\nMcGraw-Hill Higher Education, 2001.\n\n[7] I. Liiv, \u201cSeriation and matrix reordering methods: An historical overview,\u201d Statistical Analysis\n\nand Data Mining, vol. 3, no. 2, pp. 70\u201391, 2010.\n\n[8] W. S. Robinson, \u201cA Method for Chronologically Ordering Archaeological Deposits,\u201d Ameri-\n\ncan Antiquity, vol. 16, no. 4, p. 293, Apr. 1951.\n\n[9] C. Ding and X. He, \u201cLinearized cluster assignment via spectral ordering,\u201d in Twenty-\ufb01rst in-\nternational conference on Machine learning - ICML \u201904. New York, New York, USA: ACM\nPress, Jul. 2004, p. 30.\n\n[10] R. Sokal and P. H. A. Sneath, Principles of Numerical Taxonomy. London: W. H. Freeman,\n\n1963.\n\n[11] A. George and A. Pothen, \u201cAn Analysis of Spectral Envelope Reduction via Quadratic As-\nsignment Problems,\u201d SIAM Journal on Matrix Analysis and Applications, vol. 18, no. 3, pp.\n706\u2013732, Jul. 1997.\n\n[12] M. Ajtai, J. Koml\u00b4os, and E. Szemer\u00b4edi, \u201cAn O(n log n) sorting network,\u201d in Proceedings of the\n\ufb01fteenth annual ACM symposium on Theory of computing - STOC \u201983. New York, New York,\nUSA: ACM Press, Dec. 1983, pp. 1\u20139.\n\n[13] K. E. Batcher, \u201cSorting networks and their applications,\u201d in Proceedings of the April 30\u2013May\n2, 1968, spring joint computer conference on - AFIPS \u201968 (Spring). New York, New York,\nUSA: ACM Press, Apr. 1968, p. 307.\n\n[14] M. Grant and S. Boyd, \u201cCVX: Matlab software for disciplined convex programming, version\n\n2.0,\u201d http://cvxr.com/cvx, Aug. 2012.\n\n[15] \u2014\u2014, \u201cGraph implementations for nonsmooth convex programs,\u201d in Recent Advances in\nLearning and Control, ser. Lecture Notes in Control and Information Sciences, V. Blon-\ndel, S. Boyd, and H. Kimura, Eds.\nSpringer-Verlag Limited, 2008, pp. 95\u2013110, http:\n//stanford.edu/\u223cboyd/graph dcp.html.\n\n[16] Gurobi Optimizer Reference Manual, Gurobi Optimization, Inc., 2014. [Online]. Available:\n\nhttp://www.gurobi.com\n\n9\n\n\f", "award": [], "sourceid": 1143, "authors": [{"given_name": "Cong Han", "family_name": "Lim", "institution": "University of Wisconsin - Madison"}, {"given_name": "Stephen", "family_name": "Wright", "institution": "UW-Madison"}]}