{"title": "Prismatic Algorithm for Discrete D.C. Programming Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 2106, "page_last": 2114, "abstract": "In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning because this generalizes the optimization of a wide class of set functions. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning.", "full_text": "Prismatic Algorithm for Discrete D.C. Programming Problem\n\nYoshinobu Kawahara(cid:3) and Takashi Washio\n\nThe Institute of Scienti\ufb01c and Industrial Research (ISIR)\n\nOsaka University\n\n8-1 Mihogaoka, Ibaraki-shi, Osaka 567-0047 JAPAN\n\n{kawahara,washio}@ar.sanken.osaka-u.ac.jp\n\nAbstract\n\nIn this paper, we propose the \ufb01rst exact algorithm for minimizing the difference of two\nsubmodular functions (D.S.), i.e., the discrete version of the D.C. programming problem.\nThe developed algorithm is a branch-and-bound-based algorithm which responds to the\nstructure of this problem through the relationship between submodularity and convexity.\nThe D.S. programming problem covers a broad range of applications in machine learn-\ning. In fact, this generalizes any set-function optimization. We empirically investigate\nthe performance of our algorithm, and illustrate the difference between exact and approx-\nimate solutions respectively obtained by the proposed and existing algorithms in feature\nselection and discriminative structure learning.\n\n1 Introduction\n\nCombinatorial optimization techniques have been actively applied to many machine learning appli-\ncations, where submodularity often plays an important role to develop algorithms [10, 16, 27, 14,\n15, 19, 1]. In fact, many fundamental problems in machine learning can be formulated as submoular\noptimization. One of the important categories would be the D.S. programming problem, i.e., the\nproblem of minimizing the difference of two submodular functions. This is a natural formulation\nof many machine learning problems, such as learning graph matching [3], discriminative structure\nlearning [21], feature selection [1] and energy minimization [24].\nIn this paper, we propose a prismatic algorithm for the D.S. programming problem, which is a\nbranch-and-bound-based algorithm responding to the speci\ufb01c structure of this problem. To the best\nof our knowledge, this is the \ufb01rst exact algorithm to the D.S. programming problem (although there\nexists an approximate algorithm for this problem [21]). As is well known, the branch-and-bound\nmethod is one of the most successful frameworks in mathematical programming and has been in-\ncorporated into commercial softwares such as CPLEX [13, 12]. We develop the algorithm based\non the analogy with the D.C. programming problem through the continuous relaxation of solution\nspaces and objective functions with the help of the Lov\u00b4asz extension [17, 11, 18]. The algorithm is\nimplemented as an iterative calculation of binary-integer linear programming (BILP).\nAlso, we discuss applications of the D.S. programming problem in machine learning and investi-\ngate empirically the performance of our method and the difference between exact and approximate\nsolutions through feature selection and discriminative structure-learning problems.\nThe remainder of this paper is organized as follows. In Section 2, we give the formulation of the\nD.S. programming problem and then describe its applications in machine learning. In Section 3,\nwe give an outline of the proposed algorithm for this problem. Then, in Section 4, we explain the\ndetails of its basic operations. And \ufb01nally, we give several empirical examples using arti\ufb01cial and\nreal-world datasets in Section 5, and conclude the paper in Section 6.\nPreliminaries and Notation: A set function f is called submodular if f (A) + f (B) \u2265 f (A \u222a\nB) + f (A \u2229 B) for all A; B \u2286 N, where N = {1;\u00b7\u00b7\u00b7 ; n} [5, 7]. Throughout this paper, we denote\n\n(cid:3)\n\nhttp://www.ar.sanken.osaka-u.ac.jp/(cid:24)kawahara/\n\n1\n\n\fby ^f the Lov\u00b4asz extension of f, i.e., a continuous function ^f : Rn \u2192 R de\ufb01ned by\n\n\u2211\nj=1 (^pj \u2212 ^pj+1)f (Uj) + ^pmf (Um);\nm(cid:0)1\n\n^f (p) =\n\n\u2211\n\nwhere Uj = {i \u2208 N : pi \u2265 ^pj} and ^p1 > \u00b7\u00b7\u00b7 > ^pm are the m distinct elements of p [17, 18]. Also,\nwe denote by IA \u2208 {0; 1}n the characteristic vector of a subset A \u2286 N, i.e., IA =\ni2A ei where\nei is the i-th unit vector. Note, through the de\ufb01nition of the characteristic vector, any subset A \u2286 N\nhas the one-to-one correspondence with the vertex of a n-dimensional cube D := {x \u2208 Rn : 0 \u2264\nxi \u2264 1(i = 1; : : : ; n)}. And, we denote by (A; t)(T ) all combinations of a real value plus subset\nwhose corresponding vectors (IA; t) are inside or on the surface of a polytope T \u2208 Rn+1.\n\n2 The D.S. Programming Problem and its Applications\n\nLet f and g are submodular functions. In this paper, we address an exact algorithm to solve the D.S.\nprogramming problem, i.e., the problem of minimizing the difference of two submodular functions:\n(1)\n\nf (A) \u2212 g(A):\n\nmin\nA(cid:18)N\n\nAs is well known, any real-valued function whose second partial derivatives are continuous every-\nwhere can be represented as the difference of two convex functions [12]. As well, the problem (1)\ngeneralizes any set-function optimization problem. Problem (1) covers a broad range of applications\nin machine learning [21, 24, 3, 1]. Here, we give a few examples.\nFeature selection using structured-sparsity inducing norms: Sparse methods for supervised\nlearning, where we aim at \ufb01nding good predictors from as few variables as possible, have attracted\nmuch interests from machine learning community. This combinatorial problem is known to be a\nsubmodular maximization problem with cardinality constraint for commonly used measures such as\nleast-squared errors [4, 14]. And as is well known, if we replace the cardinality function with its\nconvex envelope such as l1-norm, this can be turned into a convex optimization problem. Recently,\nit is reported that submodular functions in place of the cardinality can give a wider family of poly-\nhedral norms and may incorporate prior knowledge or structural constraints in sparse methods [1].\nThen, the objective (that is supposed to be minimized) becomes the sum of a loss function (often,\nsupermodular) and submodular regularization terms.\nDiscriminative structure learning: It is reported that discriminatively structured Bayesian clas-\nsi\ufb01er often outperforms generatively structured one [21, 22]. One commonly used metric for dis-\ncriminative structure learning would be EAR (explaining away residual) [2]. EAR is de\ufb01ned as the\ndifference of the conditional mutual information between variables by class C and non-conditional\none, i.e., I(Xi; Xj|C) \u2212 I(Xi; Xj).\nIn structure learning, we repeatedly try to \ufb01nd a subset in\nvariables that minimize this kind of measures. Since the (symmetric) mutual information is a sub-\nmodular function, obviously this problem leads the D.S. programming problem [21].\nEnergy minimization in computer vision: In computer vision, an image is often modeled with\na Markov random \ufb01eld, where each node represents a pixel. Let G = (V;E) be the undirected\ngraph, where a label xs \u2208 L is assigned on each node. Then, many tasks in computer vision\ncan be naturally formulated in terms of energy minimization where the energy function has the\n(p;q)2E (cid:18)(xp; xq), where (cid:18)p and (cid:18)p;q are univariate and pairwise\nform: E(x) =\npotentials. In a pairwise potential for binarized energy (i.e., L = {0; 1}), submodularity is de\ufb01ned\nas (cid:18)pq(1; 1) + (cid:18)pq(0; 0) \u2265 (cid:18)pq(1; 0) + (cid:18)pq(0; 1) (see, for example, [26]). Based on this, any energy\nfunction in computer vision can be written with a submodular function E1(x) and a supermodular\nfunction E2(x) as E(x) = E1(x) + E2(x) (ex. [24]). Or, in case of binarized energy, even if such\nexplicit decomposition is not known, a non-unique decomposition to submodular and supermodular\nfunctions can be always given [25].\n\np2V (cid:18)p(xp) +\n\n\u2211\n\n\u2211\n\n3 Prismatic Algorithm for the D.S. Programming Problem\nBy introducing an additional variable t(\u2208 R), Problem (1) can be converted into the equivalent\nproblem with a supermodular objective function and a submodular feasible set, i.e.,\n\nA(cid:18)N;t2R t \u2212 g(A)\n\nmin\n\ns.t. f (A) \u2212 t \u2264 0:\n\n(2)\n\n2\n\n\f(cid:3)\n\n; t\n\n(cid:3)\n) is an optimal solution of Prob-\nObviously, if (A\n(cid:3) is an optimal solution of Problem (1)\nlem (2), then A\n(cid:3)\n(cid:3)\nand t\n). The proposed algorithm is a realization\n= f (A\nof the branch-and-bound scheme which responds to this\nspeci\ufb01c structure of the problem.\nTo this end, we \ufb01rst de\ufb01ne a prism T (S) \u2282 Rn+1 by\n\nT = {(x; t) \u2208 Rn \u00d7 R : x \u2208 S};\n\n\u222a\n\nIllustration of the pris-\n\nFigure 1:\nmatic algorithm.\n\nwhere S is an n-simplex. S is obtained from the n-\ndimensional cube D at the initial iteration (as described\nin Section 4.1), or by the subdivision operation described\nin the later part of this section (and the detail will be de-\nscribed in Section 4.2). The prism T has n + 1 edges that\nare vertical lines (i.e., lines parallel to the t-axis) which\npass through the n + 1 vertices of S, respectively [11].\nOur algorithm is an iterative procedure which mainly consists of two parts; branching and bounding,\nas well as other branch-and-bound frameworks [13]. In branching, subproblems are constructed by\ndividing the feasible region of a parent problem. And in bounding, we judge whether an optimal\nsolution exists in the region of a subproblem and its descendants by calculating an upper bound of\nthe subproblem and comparing it with an lower bound of the original problem. Some more details\nfor branching and bounding are described as follows.\nBranching: The branching operation in our method is carried out using the property of a simplex.\nThat is, since, in a n-simplex, any r + 1 vertices are not on a r \u2212 1-dimensional hyperplane for\ni=1 Si, where p \u2265 2 and Si are n-simplices such\nr \u2264 n, any n-simplex can be divided as S =\nthat each pair of simplices Si; Sj(i \u0338= j) intersects at most in common boundary points (the way of\ni=1 Ti, where Ti = {(x; t) \u2208\nconstructing such partition is explained in Section 4.2). Then, T =\nRn \u00d7 R : x \u2208 Si}, is a natural prismatic partition of T induced by the above simplical partition.\nBounding: For the bounding operation on Sk (resp., Tk) at the iteration k, we consider a polyhe-\ndral convex set Pk such that Pk \u2283 ~D, where ~D = {(x; t) \u2208 Rn \u00d7 R : x \u2208 D; ^f (x) \u2264 t} is the\nregion corresponding to the feasible set of Problem (2). At the \ufb01rst iteration, such P is obtained as\nwhere ~t is a real number satisfying ~t \u2264 min{f (A) : A \u2286 N}. Here, ~t can be determined by using\nsome existing submodular minimization solver [23, 8]. Or, at later iterations, more re\ufb01ned Pk, such\nthat P0 \u2283 P1 \u2283 \u00b7\u00b7\u00b7 \u2283 ~D, is constructed as described in Section 4.4.\nAs described in Section 4.3, a lower bound (cid:12)(Tk) of t \u2212 g(A) on the current prism Tk can be\ncalculated through the binary-integer linear programming (BILP) (or the linear programming (LP))\nusing Pk, obtained as described above. Let (cid:11) be the lowest function value (i.e., an upper bound of\nt \u2212 g(A) on ~D) found so far. Then, if (cid:12)(Tk) \u2265 (cid:11), we can conclude that there is no feasible solution\nwhich gives a function value better than (cid:11) and can remove Tk without loss of optimality.\nThe pseudo-code of the proposed algorithm is described in Algorithm 1. In the following section,\nwe explain the details of the operations involved in this algorithm.\n\nP0 = {(x; t) \u2208 Rn \u00d7 R : x \u2208 S; t \u2265 ~t};\n\np\n\n\u222a\n\np\n\n4 Basic Operations\n\nObviously, the procedure described in Section 3 involves the following basic operations:\n\n1. Construction of the \ufb01rst prism: A prism needs to be constructed from a hypercube at \ufb01rst,\n2. Subdivision process: A prism is divided into a \ufb01nite number of sub-prisms at each iteration,\n3. Bound estimation: For each prism generated throughout the algorithm, a lower bound for the\nobjective function t\u2212 g(A) over the part of the feasible set contained in this prism is computed,\n4. Construction of cutting planes: Throughout the algorithm, a sequence of polyhedral convex sets\nP0; P1;\u00b7\u00b7\u00b7 is constructed such that P0 \u2283 P1 \u2283 \u00b7\u00b7\u00b7 \u2283 ~D. Each set Pj is generated by a cutting\nplane to cut off a part of Pj(cid:0)1, and\n\n5. Deletion of non-optimal prisms: At each iteration, we try to delete prisms that contain no\n\nfeasible solution better than the one obtained so far.\n\n3\n\nvTDS(0,0)(1,1)(0,1)rS1S2(1,0)\f1\n2\n\n3\n4\n5\n6\n7\n\n8\n9\n\n10\n11\n12\n\n13\n\n14\n15\n\nConstruct a simplex S0 \u2283 D, its corresponding prism T0 and a polyhedral convex set P0 \u2283 ~D.\nLet (cid:11)0 be the best objective function value known in advance. Then, solve the BILP (5)\ncorresponding to (cid:11)0 and T0, and let (cid:12)0 = (cid:12)(T0; P0; (cid:11)0) and ( (cid:22)A0; (cid:22)t0) be the point satisfying\n(cid:12)0 = (cid:22)t0 \u2212 g( (cid:22)A0).\nSet R0 \u2190 T0.\nwhile Rk \u0338= \u2205\n\n(cid:3)\nk\n\n\u2208 Rk satisfying (cid:12)k = (cid:12)(T\n\nk ); ((cid:22)vk; (cid:22)tk) \u2208 T\n(cid:3)\n\n(cid:3)\nk .\n\nelse\n\nSelect a prism T\nif ((cid:22)vk; (cid:22)tk) \u2208 ~D then\nSet Pk+1 = Pk.\nConstruct lk(x; t) according to (8), and set Pk+1 = {(x; t) \u2208 Pk : lk(x; t) \u2264 0}.\nk) into a \ufb01nite number of subprisms Tk;j(j\u2208Jk) (cf. Section 4.2).\n(cid:3)\n\nSubdivide T\nFor each j \u2208 Jk, solve the BILP (5) with respect to Tk;j, Pk+1 and (cid:11)k.\nDelete all Tk;j(j\u2208Jk) satisfying (DR1) or (DR2). Let Mk denote the collection of\nremaining prisms Tk;j(j \u2208 Jk), and for each T \u2208 Mk set\n\n(cid:3)\nk = T (S\n\n(cid:12)(T ) = max{(cid:12)(T\n\nk ); (cid:12)(T; Pk+1; (cid:11)k)}:\n(cid:3)\n\nLet Fk be the set of new feasible points detected while solving BILP in Step 11, and set\nDelete all T\u2208Mk satisfying (cid:12)(T )\u2265(cid:11)k+1 and let Rk be Rk(cid:0)1 \\ Tk \u2208 Mk.\nSet (cid:12)k+1 \u2190 min{(cid:12)(T ) : T \u2208 Mk} and k \u2190 k + 1.\n\n(cid:11)k+1 = min{(cid:11)k; min{t \u2212 g(A) : (A; t) \u2208 Fk}}:\n\nAlgorithm 1: Pseudo-code of the prismatic algorithm for the D.S programming problem.\n\ni2Av\n\n4.1 Construction of the \ufb01rst prism\n\u2211\nThe initial simplex S0 \u2283 D (which yields the initial prism T0 \u2283 ~D) can be constructed as follows.\nNow, let v and Av be a vertex of D and its corresponding subset in N, respectively, i.e., v =\n\nei. Then, the initial simplex S0 \u2283 D can be constructed by\n\u2211\n\nS0 = {x \u2208 Rn : xi \u2264 1(i \u2208 Av); xi \u2265 0(i \u2208 N \\ Av); aT x \u2264 (cid:13)};\ni2NnAv\n\nei and (cid:13) = |N \\ Av|. The n + 1 vertices of S0 are v and the n\nwhere a =\npoints where the hyperplane {x \u2208 Rn : aT x = (cid:13)} intersects the edges of the cone {x \u2208 Rn : xi \u2264\n1(i \u2208 Av); xi \u2265 0(i \u2208 N \\ Av)}. Note this is just an option and any n-simplex S \u2283 D is available.\n\nei \u2212\u2211\n\ni2Av\n\n4.2 Sub-division of a prism\n\nk\n\nk\n\nk\n\nn+1\n\nn+1\n\ni=1 (cid:21)ivi\nk;\n\nk; : : : ; vn+1\n\n] := conv{v1\n\u2211\n\nLet Sk and Tk be the simplex and prism at k-th iteration in the algorithm, respectively. We denote Sk\n} which is de\ufb01ned as the convex hull of its vertices\nas Sk = [vi\nk; : : : ; vn+1\nv1\n\nk; : : : ; vn+1\n. Then, any r \u2208 Sk can be represented as\n\n\u2211\ni=1 (cid:21)i = 1; (cid:21)i \u2265 0 (i = 1; : : : ; n + 1):\nr =\nk (i = 1; : : : ; n + 1). For each i satisfying (cid:21)i > 0, let Si\n\nSuppose that r \u0338= vi\nSk de\ufb01ned by\n\u222a\nThen, the collection {Si\nk : (cid:21)i > 0} de\ufb01nes a partition of Sk, i.e., we have\ni \u0338= j [12]:\n(cid:21)i>0Si\nk de\ufb01ned in Eq. (3) form a partition\nIn a natural way, the prisms T (Si\nof Tk. This subdivision process of prisms is exhaustive, i.e., for every nested (decreasing) sequence\nof prisms {Tq} generated by this process, we have\nq=0 Tq = (cid:28), where (cid:28) is a line perpendicular\nto Rn (a vertical line) [11]. Although several subdivision process can be applied, we use a classical\nbisection one, i.e., each simplex is divided into subsimplices by choosing in Eq. (3) as\n\nk = \u2205 for\nk = Sk; int Si\n\u22291\nk\nk) generated by the simplices Si\n\nk be the subsimplex of\n\nk; : : : ; vi(cid:0)1\n\n\u2229 int Sj\n\nSi\nk = [v1\n\n; r; vi+1\n\n; : : : ; vn+1\n\n(3)\n\n]:\n\nk\n\nk\n\nk\n\nwhere \u2225vi1\n\nk\n\n\u2212 vi2\n\nk\n\n\u2225 = max{\u2225vi\n\nk\n\n\u2212 vj\n\nk\n\nk + vi2\n\nr = (vi1\n\u2225 : i; j \u2208 {0; : : : ; n}; i \u0338= j} (see Figure 1).\n\nk )=2;\n\n4\n\n\f4.3 Lower bounds\n\nAgain, let Sk and Tk be the simplex and prism at k-th iteration in the algorithm, respectively. And,\nlet (cid:11) be an upper bound of t \u2212 g(A), which is the smallest value of t \u2212 g(A) attained at a feasible\npoint known so far in the algorithm. Moreover, let Pk be a polyhedral convex set which contains ~D\nand be represented as\n(4)\nwhere Ak is a real (m\u00d7 n)-matrix and ak; bk \u2208 Rm.1 Now, a lower bound (cid:12)(Tk; Pk; (cid:11)) of t\u2212 g(A)\nover Tk \u2229 ~D can be computed as follows.\nk (i = 1; : : : ; n + 1) denote the vertices of Sk, and de\ufb01ne I(Sk) = {i \u2208 {1; : : : ; n + 1} :\nFirst, let vi\nvi\nk\n\nPk = {(x; t) \u2208 Rn \u00d7 R : Akx + akt \u2264 bk};\n\n\u2208 Bn} and\n\n{\n\n(cid:22) =\n\nmin{(cid:11); min{ ^f (vi\n(cid:11);\n\nk) \u2212 ^g(vi\n\nk) : i \u2208 I(S)}};\n\nif I(S) \u0338= \u2205;\nif I(S) = \u2205:\n\nFor each i = 1; : : : ; n + 1, consider the point (vi\nintersects the level set {(x; t) : t \u2212 ^g(x) = (cid:22)}, i.e.,\n\nk; ti\n\nk) where the edge of Tk passing through vi\n\nk\n\nti\nk = ^g(vi\n\nk) + (cid:22) (i = 1; : : : ; n + 1):\n\nk) by H = {(x; t) \u2208\nThen, let us denote the uniquely de\ufb01ned hyperplane through the points (vi\nRn\u00d7R : pT x\u2212t = (cid:13)}, where p \u2208 Rn and (cid:13) \u2208 R. Consider the upper and lower halfspace generated\nby H, i.e., H+ = {(x; t) \u2208 Rn \u00d7 R : pT x \u2212 t \u2264 (cid:13)} and H(cid:0) = {(x; t) \u2208 Rn \u00d7 R : pT x \u2212 t \u2265 (cid:13)}.\nIf Tk \u2229 ~D \u2286 H+, then we see from the supermodularity of g(A) (the concavity of ^g(x)) that\nminft (cid:0) g(A) : (A; t) 2 (A; t)(Tk \\ ~D)g (cid:21) minft (cid:0) g(A) : (A; t) 2 (A; t)(Tk \\ H+)g\n\nk; ti\n\n(cid:21) minft (cid:0) ^g(x) : (x; t) 2 Tk \\ H+g\n= ti\nk)(i = 1; : : : ; n + 1) = (cid:22):\n\nk (cid:0) ^g(xi\n\n(cid:3)\n\n(cid:3)\n\n; t\n\n(cid:3) \u2208 Bn) ((x\n(cid:3)\n\n) (\u2208 Tk \u2229 Pk \u2229 H(cid:0); x\n\nOtherwise, we shift the hyperplane H (downward with respect to t) until it reaches a point z =\n) is a point with the largest distance to H and the\n(x\n(cid:3) \u2208 Bn) is in (A; t)(Tk \u2229 Pk \u2229 H(cid:0))). Let (cid:22)H denote the resulting\ncorresponding pair (A; t) (since x\nsupporting hyperplane, and denote by (cid:22)H+ the upper halfspace generated by (cid:22)H. Moreover, for each\nk) be the point where the edge of T passing through vi\nk intersects\ni = 1; : : : ; n + 1, let zi = (vi\n(cid:22)H. Then, it follows (A; t)(Tk \u2229 ~D) \u2282 (A; t)(Tk \u2229 Pk) \u2282 (A; t)(Tk \u2229 (cid:22)H+), and hence\n\nk; (cid:22)ti\n\n(cid:3)\n; t\n\nminft (cid:0) g(A) : (A; t) 2 (A; t)(Tk \\ ~D)g > minft (cid:0) g(A) : (A; t) 2 (A; t)(Tk \\ (cid:22)H+)g\n\n= minf(cid:22)ti\n\nk (cid:0) ^g(vi\n\nk) : i = 1; : : : ; n + 1g:\n\nNow, the above consideration leads to the following BILP in (\u03bb; x; t):\n\n\u2211\n\n(\u2211\ni=1 ti(cid:21)i \u2212 t\n\nn+1\n\n)\n\nmax\n(cid:21);x;t\n\nk; x \u2208 Bn;\n\nn+1\n\ni=1 (cid:21)ivi\n\n\u2211\ns.t. Akx + akt \u2264 bk; x =\ni=1 (cid:21)i = 1; (cid:21)i \u2265 0 (i = 1; : : : ; n + 1);\n\u2211\n\nn+1\n\n(5)\n\n(cid:3)\n\n(cid:3)\n\n; x\n\n(cid:3)\n; t\n\n\u2211\n\n(cid:3)\n) be an optimal solution of BILP (5) and c\n\nwhere A, a and b are given in Eq. (4).\nProposition 1. (a) If the system (5) has no solution, then intersection (A; t)(Tk \u2229 ~D) is empty.\n\u2212 t\n(b) Otherwise, let (\u03bb\noptimal value, respectively. Then, the following statements hold:\n(cid:3) \u2264 0, then (A; t)(Tk \u2229 ~D) \u2282 (A; t)(H+).\n(b1) If c\n(cid:3)\n(b2) If c\n(cid:22) \u2212 c\n(cid:3)\nProof. First, we prove part (a). Since every point in Sk is uniquely representable as x =\ni=1 (cid:21)ivi,\nwe see from Eq. (4) that the set (A; t)(Tk \u2229 Pk) coincide with the feasible set of problem (5).\nTherefore, if the system (5) has no solution, then (A; t)(Tk\u2229Pk) = \u2205, and hence (A; t)(Tk\u2229 ~D) = \u2205\n(because ~D \u2282 Pk). Next, we move to part (b). Since the equation of H is pT x \u2212 t = (cid:13), it follows\n\n(cid:3)\nk), zi = (vi\nk; t\n\n\u2212 ^g(vi\n\u2211\n\n(i = 1; : : : ; n + 1).\n\n> 0, then z = (\n\nn+1\ni=1 (cid:21)ivi\n\nk) = (vi\n\n) and (cid:22)ti\nk\n\nn+1\ni=1 ti(cid:21)\n\n\u2212 c\n(cid:3)\n\n(cid:3) its\n\nk) =\n\nk; ti\nk\n\nk; (cid:22)ti\n\nn+1\n\n(cid:3)\ni\n\n=\n\n1Note that Pk is updated at each iteration, which does not depend on Sk, as described in Section 4.4.\n\n5\n\n\f\u2212 t:\n\nthat determining the hyperplane (cid:22)H and the point z amounts to solving the binary integer linear\nprogramming problem:\n\nk\n\nn+1\n\nmax pT x \u2212 t\n\ns.t. (x; t) \u2208 Tk \u2229 Pk; x \u2208 Bn:\n\nn+1\n\ni=1 (cid:21)ivi\n\nk\n\n(\u2211\n\nn+1\n\ni=1 (cid:21)ipT vi\n\nk\n\nn+1\n\ni=1 (cid:21)i((cid:13) + ti\n\nOn the other hand, since (vi\n\nk; ti\npT x \u2212 t =\n\nk = (cid:13) (i = 1; : : : ; n + 1), and hence\n\nHere, we note that the objective of the above can be represented as\n\n)\npT x \u2212 t = pT\n\u2211\nk) \u2208 H, we have pT vi\n\n\u2211\n\u2212 t =\n\u2211\n\u2212 ti\nk) \u2212 t =\nThus, the two BILPs (5) and (6) are equivalent. And, if (cid:13)\nvalue in Eq. (6), then (cid:13)\nobtained by a parallel shift of H in the direction H+. Therefore, c\n(A; t)(H+), and hence (A; t)(Tk \u2229 ~D) \u2282 (A; t)(H+).\nSince (cid:22)H = {(x; t) \u2208 Rn \u00d7 R : pT x \u2212 t = (cid:13)\nk; (cid:22)ti\nwe see that for each intersection point (vi\nwith (cid:22)H (and H), we have pT vi\nk = (cid:13)\n(cid:3), and (using ti\nk\n(cid:22)ti\nk = ti\nFrom the above, we see that, in the case (b1), (cid:22) constitutes a lower bound of (t\u2212g(A)) wheres, in the\nk) : i = 1; : : : ; n + 1}. Thus, Proposition 1\ncase (b2), such a lower bound is given by min{(cid:22)ti\nprovides the lower bound\n\ni=1 ti\n(cid:3) denotes the optimal objective function\n(cid:3) \u2264 (cid:13), then it follows from the de\ufb01nition of H+ that (cid:22)H is\n(cid:3) \u2264 0 implies (A; t)(Tk \u2229 Pk) \u2282\n(cid:3)} and H = {(x; t) \u2208 Rn \u00d7 R : pT x \u2212 t = (cid:13)}\nk)) of the edge of Tk passing through vi\nk; ti\n\u2212 ti\nk = (cid:13), respectively. This implies that\nk\n\nk) (and (vi\n(cid:3) and pT vi\nk = ^g(vi\n\nk) + (cid:22) \u2212 c\n(cid:3).\n\nk(cid:21)i \u2212 t + (cid:13):\n\nk + (cid:13) \u2212 (cid:13)\n\nk) + (cid:22)) that (cid:22)ti\n\nk = ^g(vi\n\n= c\n\n+ (cid:13). If (cid:13)\n\n\u2212 ^g(vi\n\nk\n\n(cid:3)\n\n= ti\nk\n\n\u2212 c\n\n(cid:3)\n\n(cid:3)\n\n\u2212 (cid:22)ti\n\n{\n\n(cid:12)k(Tk; Pk; (cid:11)) =\n\n+\u221e;\n(cid:22);\n(cid:22) \u2212 c\n(cid:3)\n\nif BILP (5) has no feasible point;\n(cid:3) \u2264 0;\nif c\n(cid:3)\nif c\n> 0:\n\n(6)\n\nk\n\n(7)\n\nAs stated in Section 4.5, Tk can be deleted from further consideration when (cid:12)k = \u221e or (cid:22).\n\n4.4 Outer approximation\nThe polyhedral convex set Pk \u2283 ~D used in the preceding section is updated in each iteration, i.e.,\na sequence P0; P1;\u00b7\u00b7\u00b7 is constructed such that P0 \u2283 P1 \u2283 \u00b7\u00b7\u00b7 \u2283 ~D. The update from Pk to Pk+1\n(k = 0; 1; : : :) is done in a way which is standard for pure outer approximation methods [12]. That\nis, a certain linear inequality lk(x; t) \u2264 0 is added to the constraint set de\ufb01ning Pk, i.e., we set\n\nPk+1 = Pk \u2229 {(x; t) \u2208 Rn \u00d7 R : lk(x; t) \u2264 0}:\n\nlk(x; t) = sT\n\nThe function lk(x; t) is constructed as follows. At iteration k, we have a lower bound (cid:12)k of t \u2212\ng(A) as de\ufb01ned in Eq. (7), and a point ((cid:22)vk; (cid:22)tk) satisfying (cid:22)tk \u2212 ^g((cid:22)vk) = (cid:12)k. We update the outer\napproximation only in the case ((cid:22)vk; (cid:22)tk) =\u2208 ~D. Then, we can set\nk) \u2212 t\nk [(x; t) \u2212 zk] + ( ^f (x\n(cid:3)\n\n(8)\nwhere sk is a subgradient of ^f (x) \u2212 t at zk. The subgradient can be calculated as, for example,\nstated in [9] (see also [7]).\nProposition 2. The hyperplane {(x; t) \u2208 Rn \u00d7 R : lk(x; t) = 0} strictly separates zk from ~D, i.e.,\nlk(zk) > 0, and lk(x; t) \u2264 0 for 8\nProof. Since we assume that zk =\u2208 ~D, we have lk(zk) = ( ^f (x\nan immediate consequence of the de\ufb01nition of a subgradient.\n\nk) \u2212 t\n(cid:3)\n(cid:3)\nk). And, the latter inequality is\n\n(x; t) \u2208 ~D.\n\n(cid:3)\nk);\n\n4.5 Deletion rules\n\nAt each iteration of the algorithm, we try to delete certain subprisms that contain no optimal solution.\nTo this end, we adopt the following two deletion rules:\n(DR1) Delete Tk if BILP (5) has no feasible solution.\n\n6\n\n\fFigure 2: Training errors, test errors and computational time versus (cid:21) for the prismatic algorithm\nand the supermodular-sumodular procedure.\n\np\n120\n120\n120\n120\n\nn\n150\n150\n150\n150\n\nk\n5\n10\n20\n40\n\nexact(PRISM)\n1.8e-4 (192.6)\n2.0e-4 (262.7)\n7.3e-4 (339.2)\n1.7e-3 (467.6)\n\nSSP\n\n1.9e-4 (0.93)\n2.4e-4 (0.81)\n7.8e-4 (1.43)\n2.1e-3 (1.17)\n\ngreedy\n\n1.8e-4 (0.45)\n2.3e-4 (0.56)\n8.3e-4 (0.59)\n2.9e-3 (0.63)\n\nlasso\n\n1.9e-4 (0.78)\n2.4e-4 (0.84)\n7.7e-4 (0.91)\n1.9e-3 (0.87)\n\nTable 1: Normalized mean-square prediction errors of training and test data by the prismatic algo-\nrithm, the supermodular-submodular procedure, the greedy algorithm and the lasso.\n\n(cid:3) of BILP (5) satis\ufb01es c\n\n(cid:3) \u2264 0.\n\n(DR2) Delete Tk if the optimal value c\nThe feasibility of these rules can be seen from Proposition 1 as well as the D.C. programing prob-\nlem [11]. That is, (DR1) follows from Proposition 1 that in this case Tk \u2229 ~D = \u2205, i.e., the prism Tk\nis infeasible, and (DR2) from Proposition 1 and from the de\ufb01nition of (cid:22) that the current best feasible\nsolution cannot be improved in T .\n\n5 Experimental Results\n\nWe \ufb01rst provide illustrations of the proposed algorithm and its solution on toy examples from feature\nselection in Section 5.1, and then apply the algorithm to an application of discriminative structure\nlearning using the UCI repository data in Section 5.2. The experiments below were run on a 2.8\nGHz 64-bit workstation using Matlab and IBM ILOG CPLEX ver. 12.1.\n\n5.1 Application to feature selection\n\nJ XJ )1=2. Thus, our problem is minw2Rp\n\n\u2225y \u2212 Xw\u22252\nJ XJ )1=2, where J is the support of w. Or equivalently, minA2V g(A) + (cid:21) \u00b7 tr(X T\n\nWe compared the performance and solutions by the proposed prismatic algorithm (PRISM), the\nsupermodular-submodular procedure (SSP) [21], the greedy method and the LASSO. To this end,\nwe generated data as follows: Given p, n and k, the design matrix X \u2208 Rn(cid:2)p is a matrix of i.i.d.\nGaussian components. A feature set J of cardinality k is chosen at random and the weights on the\nselected features are sampled from a standard multivariate Gaussian distribution. The weights on\n(cid:0)1=2\u2225Xw\u22252\u03f5, where w is the weights on features\nother features are 0. We then take y = Xw + n\nand \u03f5 is a standard Gaussian vector. In the experiment, we used the trace norm of the submatrix\n2 + (cid:21) \u00b7\ncorresponding to J, XJ, i.e., tr(X T\nA XA)1=2,\ntr(X T\nwhere g(A) := minwA2RjAj \u2225y \u2212 XAwA\u22252. Since the \ufb01rst term is a supermodular function [4] and\nthe second is a submodular function, this problem is the D.S. programming problem.\nFirst, the graphs in Figure 2 show the training errors, test errors and computational time versus (cid:21) for\nPRISM and SSP (for p = 120, n = 150 and k = 10). The values in the graphs are averaged over 20\ndatasets. For the test errors, we generated another 100 data from the same model and applied the es-\ntimated model to the data. And, for all methods, we tried several possible regularization parameters.\nFrom the graphs, we can see the following: First, exact solutions (by PRISM) always outperform\napproximate ones (by SSP). This would show the signi\ufb01cance of optimizing the submodular-norm.\nThat is, we could obtain the better solutions (in the sense of prediction error) by optimizing the\nobjective with the submodular norm more exactly. And, our algorithm took longer especially when\n\n1\n2n\n\n7\n\n(cid:20)(cid:19)(cid:827)(cid:23)(cid:20)(cid:19)(cid:827)(cid:22)(cid:20)(cid:17)(cid:27)(cid:20)(cid:17)(cid:28)(cid:21)(cid:21)(cid:17)(cid:20)(cid:21)(cid:17)(cid:21)(cid:21)(cid:17)(cid:22)(cid:21)(cid:17)(cid:23)(cid:91)(cid:98)(cid:20)(cid:19)(cid:827)(cid:25)(cid:98)(cid:98)Exact (Prismatic)Approx. (Supermodular-submodular)\u03bbTraining Error(cid:20)(cid:19)(cid:827)(cid:23)(cid:20)(cid:19)(cid:827)(cid:22)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:27)(cid:91)(cid:98)(cid:20)(cid:19)(cid:827)(cid:23)\u03bbTest Error(cid:98)(cid:98)Exact (Prismatic)Approx. (Supermodular-sumodular)(cid:20)(cid:19)(cid:827)(cid:23)(cid:20)(cid:19)(cid:827)(cid:22)(cid:19)(cid:20)(cid:19)(cid:19)(cid:21)(cid:19)(cid:19)(cid:22)(cid:19)(cid:19)(cid:23)(cid:19)(cid:19)(cid:24)(cid:19)(cid:19)(cid:25)(cid:19)(cid:19)(cid:26)(cid:19)(cid:19)\u03bbTime [second](cid:98)(cid:98)Exact (Prismatic)Approx. (Supermodular-sumodular)\fData\nChess\nGerman\nCensus-income\nHepatitis\n\nAttr. Class\n36\n20\n40\n19\n\n2\n2\n2\n2\n\nexact (PRISM)\n96.6 (\u00b10.69)\n70.0 (\u00b10.43)\n73.2 (\u00b10.64)\n86.9 (\u00b11.89)\n\napprox. (SSP)\n94.4 (\u00b10.71)\n69.9 (\u00b10.43)\n71.2 (\u00b10.74)\n84.3 (\u00b12.31)\n\ngenerative\n92.3 (\u00b10.79)\n69.1 (\u00b10.49)\n70.3 (\u00b10.74)\n84.2 (\u00b12.11)\n\nTable 2: Empirical accuracy of the classi\ufb01ers in [%] with standard deviation by the TANs discrim-\ninatively learned with PRISM or SSP and generatively learned with a submodular minimization\nsolver. The numbers in parentheses are computational time in seconds.\n\n(cid:21) smaller. This would be because smaller (cid:21) basically gives a larger size subset (solution). Also,\nTable 1 shows normalized-mean prediction errors by the prismatic algorithm, the supermodular-\nsubmodular procedure, the greedy method and the lasso for several k. The values are averaged over\n10 datasets. This result also seems to show that optimizing the objective with the submodular norm\nexactly is signi\ufb01cant in the meaning of prediction errors.\n\n5.2 Application to discriminative structure learning\n\nOur second application is discriminative structure learning using the UCI machine learning reposi-\ntory.2 Here, we used CHESS, GERMAN, CENSUS-INCOME (KDD) and HEPATITIS, which have\ntwo classes. The Bayesian network topology used was the tree augmented naive Bayes (TAN) [22].\nWe estimated TANs from data both in generative and discriminative manners. To this end, we used\nthe procedure described in [20] with a submodular minimization solver (for the generative case), and\nthe one [21] combined with our prismatic algorithm (PRISM) or the supermodular-submodular pro-\ncedure (SSP) (for the discriminative case). Once the structures have been estimated, the parameters\nwere learned based on the maximum likelihood method.\nTable 2 shows the empirical accuracy of the classi\ufb01er in [%] with standard deviation for these\ndatasets. We used the train/test scheme described in [6, 22]. Also, we removed instances with\nmissing values. The results seem to show that optimizing the EAR measure more exactly could\nimprove the performance of classi\ufb01cation (which would mean that the EAR is signi\ufb01cant as the\nmeasure of discriminative structure learning in the sense of classi\ufb01cation).\n\n6 Conclusions\n\nIn this paper, we proposed a prismatic algorithm for the D.S. programming problem (1), which is the\n\ufb01rst exact algorithm for this problem and is a branch-and-bound method responding to the structure\nof this problem. We developed the algorithm based on the analogy with the D.C. programming\nproblem through the continuous relaxation of solution spaces and objective functions with the help\nof the Lov\u00b4asz extension. We applied the proposed algorithm to several situations of feature selection\nand discriminative structure learning using arti\ufb01cial and real-world datasets.\nThe D.S. programming problem addressed in this paper covers a broad range of applications in\nmachine learning. In future works, we will develop a series of the presented framework specialized\nto the speci\ufb01c structure of each problem. Also, it would be interesting to investigate the extension\nof our method to enumerate solutions, which could make the framework more useful in practice.\n\nAcknowledgments\n\nThis research was supported in part by JST PRESTO PROGRAM (Synthesis of Knowledge for\nInformation Oriented Society), JST ERATO PROGRAM (Minato Discrete Structure Manipulation\nSystem Project) and KAKENHI (22700147). Also, we are very grateful to the reviewers for helpful\ncomments.\n\n2http://archive.ics.uci.edu/ml/index.html\n\n8\n\n\fReferences\n[1] F. Bach. Structured sparsity-inducing norms through submodular functions. In Advances in Neural Infor-\n\nmation Processing Systems 23, pages 118\u2013126, 2010.\n\n[2] J. A. Bilmes. Dynamic Bayesian multinets. In Proc. of the 16th Conf. on Uncertainty in Arti\ufb01cial Intelli-\n\ngence (UAI\u201900), pages 38\u201345, 2000.\n\n[3] T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola. Learning graph matching. IEEE Trans.\n\non Pattern Analysis and Machine Intelligence, 31(6):1048\u20131058, 2009.\n\n[4] A. Das and D. Kempe. Algorithms for subset selection in linear regression. In Proc. of the 40th annual\n\nACM symp. on Theory of computing (STOC\u201908), pages 45\u201354, 2008.\n\n[5] J. Edmonds. Submodular functions, matroids, and certain polyhedra. In R. Guy, H. Hanani, N. Sauer, and\n\nJ. Sch\u00a8onheim, editors, Combinatorial structures and their applications, pages 69\u201387, 1970.\n\n[6] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classi\ufb01er. 29:131\u2013163, 1997.\n[7] S. Fujishige. Submodular Functions and Optimization. Elsevier, 2 edition, 2005.\n[8] S. Fujishige, T. Hayashi, and S. Isotani. The minimum-norm-point algorithm applied submodular function\nminimization and linear programming. Technical report, Research Institute for Mathematical Sciences,\nKyoto University, 2006.\n\n[9] E. Hazan and S. Kale. Beyond convexity: online submodular minimization.\n\nInformation Processing Systems 22, pages 700\u2013708, 2009.\n\nIn Advances in Neural\n\n[10] S. Hoi, R. Jin, J. Zhu, and M. Lyu. Batch mode active learning and its application to medical image\n\nclassi\ufb01cation. In Proc. of the 23rd Int\u2019l Conf. on Machine learning (ICML\u201906), pages 417\u2013424, 2006.\n\n[11] R. Horst, T. Q. Phong, Ng. V. Thoai, and J. de Vries. On solving a D.C. programming problem by a\n\nsequence of linear programs. Journal of Global Optimization, 1:183\u2013203, 1991.\n\n[12] R. Horst and H. Tuy. Global Optimization (Deterministic Approaches). Springer, 3 edition, 1996.\n[13] T. Ibaraki. Enumerative approaches to combinatorial optimization. In J.C. Baltzer and A.G. Basel, editors,\n\nAnnals of Operations Research, volume 10 and 11. 1987.\n\n[14] Y. Kawahara, K. Nagano, K. Tsuda, and J. A. Bilmes. Submodularity cuts and applications. In Advances\n\nin Neural Information Processing Systems 22, pages 916\u2013924. MIT Press, 2009.\n\n[15] A. Krause and V. Cevher. Submodular dictionary selection for sparse representation. In Proc. of the 27th\n\nInt\u2019l Conf. on Machine learning (ICML\u201910), pages 567\u2013574. Omnipress, 2010.\n\n[16] A. Krause, H. B. McMahan, C. Guestrin, and A. Gupta. Robust submodular observation selection. Journal\n\nof Machine Learning Research, 9:2761\u20132801, 2008.\n\n[17] L. Lov\u00b4asz. Submodular functions and convexity. In A. Bachem, M. Gr\u00a8otschel, and B. Korte, editors,\n\nMathematical Programming \u2013 The State of the Art, pages 235\u2013257. 1983.\n\n[18] K. Murota. Discrete Convex Analysis. Monographs on Discrete Math and Applications. SIAM, 2003.\n[19] K. Nagano, Y. Kawahara, and S. Iwata. Minimum average cost clustering. In Advances in Neural Infor-\n\nmation Processing Systems 23, pages 1759\u20131767, 2010.\n\n[20] M. Narasimhan and J. A. Bilmes. PAC-learning bounded tree-width graphical models. In Proc. of the\n\n20th Ann. Conf. on Uncertainty in Arti\ufb01cial Intelligence (UAI\u201904), pages 410\u2013417, 2004.\n\n[21] M. Narasimhan and J. A. Bilmes. A submodular-supermodular procedure with applications to discrimina-\ntive structure learning. In Proc. of the 21st Ann. Conf. on Uncertainty in Arti\ufb01cial Intelligence (UAI\u201905),\npages 404\u2013412, 2005.\n\n[22] F. Pernkopf and J. A. Bilmes. Discriminative versus generative parameter and structure learning of\nbayesian network classi\ufb01ers. In Proc. of the 22nd Int\u2019l Conf. on Machine Learning (ICML\u201905), pages\n657\u2013664, 2005.\n\n[23] M. Queyranne. Minimizing symmetric submodular functions. Math. Prog., 82(1):3\u201312, 1998.\n[24] C. Rother, T. Minka, A. Blake, and V. Kolmogorov. Cosegmentation of image pairs by histogram\nmatching-incorporating a global constraint into mrfs. In Proc. of the 2006 IEEE Comp. Soc. Conf. on\nComputer Vision and Pattern Recognition (CVPR\u201906), pages 993\u20131000, 2006.\n\n[25] A. Shekhovtsov. Supermodular decomposition of structural labeling problem. Control Systems and Com-\n\nputers, 20(1):39\u201348, 2006.\n\n[26] A. Shekhovtsov, V. Kolmogorov, P. Kohli, V. Hlav c, C. Rother, and P. Torr. Lp-relaxation of binarized\n\nenergy minimization. Technical Report CTU-CMP-2007-27, Czech Technical University, 2007.\n\n[27] M. Thoma, H. Cheng, A. Gretton, H. Han, H. P. Kriegel, A. J. Smola, S. Y. Le Song Philip, X. Yan, and\nK. Borgwardt. Near-optimal supervised feature selection among frequent subgraphs. In Proc. of the 2009\nSIAM Conf. on Data Mining (SDM\u201909), pages 1076\u20131087, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1177, "authors": [{"given_name": "Yoshinobu", "family_name": "Kawahara", "institution": null}, {"given_name": "Takashi", "family_name": "Washio", "institution": null}]}