{"title": "The Concave-Convex Procedure (CCCP)", "book": "Advances in Neural Information Processing Systems", "page_first": 1033, "page_last": 1040, "abstract": null, "full_text": "The Concave-Convex Procedure (CCCP) \n\nA. L. Yuille and Anand Rangarajan * \nSmith-Kettlewell Eye Research Institute, \n\n2318 Fillmore Street, \n\nSan Francisco, CA 94115, USA. \n\nTel. (415) 345-2144. Fax. (415) 345-8455. \n\nEmail yuille@ski.org \n\n* Prof. Anand Rangarajan. Dept. of CISE, Univ. of Florida Room 301, CSE \nBuilding Gainesville, FL 32611-6120 Phone: (352) 392 1507 Fax: (352) 392 1220 \ne-mail: anand@cise.ufl.edu \n\nAbstract \n\nWe introduce the Concave-Convex procedure (CCCP) which con(cid:173)\nstructs discrete time iterative dynamical systems which are guar(cid:173)\nanteed to monotonically decrease global optimization/energy func(cid:173)\ntions. It can be applied to (almost) any optimization problem and \nmany existing algorithms can be interpreted in terms of CCCP. In \nparticular, we prove relationships to some applications of Legendre \ntransform techniques. We then illustrate CCCP by applications to \nPotts models, linear assignment, EM algorithms, and Generalized \nIterative Scaling (GIS). CCCP can be used both as a new way to \nunderstand existing optimization algorithms and as a procedure for \ngenerating new algorithms. \n\n1 \n\nIntroduction \n\nThere is a lot of interest in designing discrete time dynamical systems for inference \nand learning (see, for example, [10], [3], [7], [13]). \n\nThis paper describes a simple geometrical Concave-Convex procedure (CCCP) for \nconstructing discrete time dynamical systems which can be guaranteed to decrease \nalmost any global optimization/energy function (see technical conditions in sec(cid:173)\ntion (2)). \n\nWe prove that there is a relationship between CCCP and optimization techniques \nbased on introducing auxiliary variables using Legendre transforms. We distinguish \nbetween Legendre min-max and Legendre minimization. In the former, see [6], the \nintroduction of auxiliary variables converts the problem to a min-max problem \nwhere the goal is to find a saddle point. By contrast, in Legendre minimization, see \n[8], the problem remains a minimization one (and so it becomes easier to analyze \n\n\fconvergence). CCCP relates to Legendre minimization only and gives a geometrical \nperspective which complements the algebraic manipulations presented in [8]. \n\nCCCP can be used both as a new way to understand existing optimization algo(cid:173)\nrithms and as a procedure for generating new algorithms. We illustrate this by \ngiving examples from Potts models, EM, linear assignment, and Generalized It(cid:173)\nerative Scaling. Recently, CCCP has also been used to construct algorithms to \nminimize the Bethe/Kikuchi free energy [13]. \n\nWe introduce CCCP in section (2) and relate it to Legendre transforms in sec(cid:173)\ntion (3). Then we give examples in section (4). \n\n2 The Concave-Convex Procedure (CCCP) \n\nThe key results of CCCP are summarized by Theorems 1,2, and 3. \n\nTheorem 1 shows that any function , subject to weak conditions, can be expressed \nas the sum of a convex and concave part (this decomposition is not unique). This \nimplies that CCCP can be applied to (almost) any optimization problem. \nTheorem 1. Let E(x) be an energy function with bounded Hessian [J2 E(x)/8x8x. \nThen we can always decompose it into the sum of a convex function and a concave \nfunction. \n\nProof. Select any convex function F(x) with positive definite Hessian with eigen(cid:173)\nvalues bounded below by f > o. Then there exists a positive constant A such that \nthe Hessian of E(x) + AF(x) is positive definite and hence E(x) + AF(x) is con(cid:173)\nvex. Hence we can express E(x) as the sum of a convex part, E(x) + AF(x) , and a \nconcave part -AF(x). \n\nFigure 1: Decomposing a function into convex and concave parts. The original func(cid:173)\ntion (Left Panel) can be expressed as the sum of a convex function (Centre Panel) \nand a concave function (Right Panel). (Figure courtesy of James M. Coughlan). \n\nOur main result is given by Theorem 2 which defines the CCCP procedure and \nproves that it converges to a minimum or saddle point of the energy. \n\nTheorem 2. Consider an energy function E(x) (bounded below) of form E(x) = \nEvex (x) + E cave (x) where Evex (x), E cave (x) are convex and concave functions of x \nrespectively. Then the discrete iterative CCCP algorithm ;zt f-7 ;zt+1 given by: \n\n-\n\\1Evex (x \n\n-t+l _ \n\n-\n\n) - -\\1Ecave (x ), \n\n-t \n\n(1) \n\nis guaranteed to monotonically decrease the energy E(x) as a function of time and \nhence to converge to a minimum or saddle point of E(x). \n\n\fProof. The convexity and concavity of Evex (.) and Ecave (.) means that Evex (X2) 2: \nEvex (xd + (X2 -xd\u00b7 ~ Evex (xd and Ecave (X4) :S Ecave (X3) + (X4 -X3)\u00b7 ~ Ecave (X3 ), \nfor all X1 ,X2,X3,X4. Now set Xl = xt+l,X2 = xt,X3 = xt,X4 = xt+1. Using the \nalgorithm definition (i.e. ~Evex (xt+1) = -~Ecave (xt)) we find that Evex (xt+ 1) + \nEcave (xt+1) :S Evex (xt) + Ecave (xt), which proves the claim. \nWe can get a graphical illustration of this algorithm by the reformulation shown in \nfigure (2) (suggested by James M. Coughlan). Think of decomposing the energy \nfunction E(x) into E1(x) - E2(x) where both E 1(x) and E2(x) are convex. (This \nis equivalent to decomposing E(x) into a a convex term E 1(x) plus a concave term \n-E2(X)) . The algorithm proceeds by matching points on the two terms which have \nthe same tangents. For an input Xo we calculate the gradient ~ E2 (xo) and find the \npoint Xl such that ~ E1 (xd = ~ E2 (xo). We next determine the point X2 such that \n~E1(X2) = ~E2 (X1)' and repeat. \n\n7~------~--------~------, \n\n60 -\n\n50 -\n\n40 -\n\n30 -\n\n20 -\n\no \n\n10 \n\nO L---~=-~O-=~~~~----~10 \n\nXO \n\nFigure 2: A CCCP algorithm illustrated for Convex minus Convex. We want to \nminimize the function in the Left Panel. We decompose it (Right Panel) into \na convex part (top curve) minus a convex term (bottom curve). The algorithm \niterates by matching points on the two curves which have the same tangent vectors, \nsee text for more details. The algorithm rapidly converges to the solution at x = 5.0. \n\nWe can extend Theorem 2 to allow for linear constraints on the variables X, for \nexample Li et Xi = aM where the {en, {aM} are constants. This follows directly \nbecause properties such as convexity and concavity are preserved when linear con(cid:173)\nstraints are imposed. We can change to new coordinates defined on the hyperplane \ndefined by the linear constraints. Then we apply Theorem 1 in this coordinate \nsystem. \nObserve that Theorem 2 defines the update as an implicit function of xt+1. In many \ncases, as we will show, it is possible to solve for xt+1 directly. In other cases we may \nneed an algorithm, or inner loop, to determine xt+1 from ~Evex (xt+1). In these \ncases we will need the following theorem where we re-express CCCP in terms of \nminimizing a time sequence of convex update energy functions Et+1 (xt+1) to obtain \nthe updates xt+1 (i.e. at the tth iteration of CCCP we need to minimize the energy \nEt+1 (xt+1 )). We include linear constraints in Theorem 3. \nTheorem 3. Let E(x) = Evex (x) + E cave (x) where X is required to satisfy the linear \nconstraints Li et Xi = aM, where the {et}, { aM} are constants. Then the update rule \nfor xt+1 can be formulated as minimizing a time sequence of convex update energy \n\n\ffunctions Et+1 (;rt+1): \n\n(2) \n\nwhere the lagrange parameters P'J1} impose linear comnstraints. \n\nProof. Direct calculation. \n\nThe convexity of EH1 (;rt+1) implies that there is a unique minimum corresponding \nto ;rt+1. This means that if an inner loop is needed to calculate ;rt+1 then we can \nuse standard techniques such as conjugate gradient descent (or even CCCP). \n\n3 Legendre Transformations \n\nThe Legendre transform can be used to reformulate optimization problems by in(cid:173)\ntroducing auxiliary variables [6]. The idea is that some of the formulations may \nbe more effective (and computationally cheaper) than others. We will concentrate \non Legendre minimization, see [7] and [8], instead of Legendre min-max emphasized \nin [6]. An advantage of Legendre minimization is that mathematical convergence \nproofs can be given. (For example, [8] proved convergence results for the algorithm \nimplemented in [7].) \n\nIn Theorem 4 we show that Legendre minimization algorithms are equivalent to \nCCCP. The CCCP viewpoint emphasizes the geometry of the approach and com(cid:173)\nplements the algebraic manipulations given in [8]. \n(Moreover, our results of the \nprevious section show the generality of CCCP while, by contrast, the Legendre \ntransform methods have been applied only on a case by case basis). \n\nDefinition 1. Let F(x) be a convex function. For each value y let F*(ff) = \nminx{F(x) +y\u00b7x.}. Then F*(Y) is concave and is the Legendre transform of F(x). \nMoreover, F (x) = maxy{ F* (y) - y. x} . \n\nProperty 1. F(.) and F*(.) are related by a:; (fJ) = {~~} - 1(_Y), -~~(x) = \n\n{a{y* } -1 (x). (By { a{y* } -1 (x) we mean the value y such that a{y* (y) = x.) \nTheorem 4. Let E1 (x) = f(x) + g(x) and E 2(x, Y) = f(x) + x\u00b7 Y + h(i/), where \nf(.), h(.) are convex functions and g(.) is concave. Then applying CCCP to E1 (x) is \nequivalent to minimizing E2 (x, Y) with respect to x and y alternatively (for suitable \nchoices of g(.) and h(.). \n\nProof. We can write E1(X) = f(x) +miny{g*(Y) +x\u00b7y} where g*(.) is the Legendre \ntransform of g( .) (identify g(.) with F*( .) and g*(.) with F(.) in definition 1). Thus \nminimizing E1 (x) with respect to x is equivalent to minimizing E1 (x, Y) = f(x) + \nx . y + g* (Y) with respect to x and y. \n(Alternatively, we can set g* (Y) = h(Y) \nin the expression for E 2(x,i/) and obtain a cost function E 2(x) = f(x) + g(x).) \nAlternatively minimization over x and y gives: (i) of/ax = y to determine Xt+1 in \nterms of Yt, and (ii) ag* / ay = x to determine Yt in terms of Xt which, by Property \n1 of the Legendre transform is equivalent to setting y = -ag / ax. Combining these \ntwo stages gives CCCP: \n\naf (_) \nag (_) \nax Xt+1 = - ax Xt . \n\n\f4 Examples of CCCP \n\nWe now illustrate CCCP by giving four examples: (i) discrete time dynamical \nsystems for the mean field Potts model, (ii) an EM algorithm for the elastic net, \n(iii) a discrete (Sinkhorn) algorithm for solving the linear assignment problem, and \n(iv) the Generalized Iterative Scaling (GIS) algorithm for parameter estimation. \n\nExample 1. Discrete Time Dynamical Systems for the Mean Field Potts \nThese attempt to minimize discrete energy functions of form E[V] = \nModel. \n2:i,j,a,b Tij ab Via V)b + 2:ia Bia Vi a, where the {Via} take discrete values {a, I} with \nlinear constraints 2:i Via = 1, Va. \nDiscussion. Mean field algorithms minimize a continuous effective energy E ett [S; T] \nf-7 a. The \nto obtain a minimum of the discrete energy E[V] in the limit as T \n{Sial are continuous variables in the range [0 ,1] and correspond to (approximate) \nestimates of the mean states of the {Via}. As described in [12}, to ensure that the \nminima of E[V] and E ett [S; T] all coincide (as T f-7 0) it is sufficient that Tijab \nbe negative definite. Moreover, this can be attained by adding a term -K 2:ia Vi! \nto E[V] (for sufficiently large K) without altering the structure of the minima of \nE[V] . Hence, without loss of generality we can consider 2:i,j,a,b Tijab Via V)b to be a \nconcave function . \n\nthe \n\nimpose \n\nWe \nterm \n2:a Pa {2:i Via - I} to the energy where the {Pa} are the Lagrange multipliers. The \neffective energy becomes: \n\nlinear constraints by adding a Lagrange multiplier \n\ni,j,a ,b \n\nia \n\nia \n\na \n\nWe can then incorporate the Lagrange multiplier term into the convex part. \nThis gives: Evex [S] = T2: ia SialogSia + 2:aPa{2:iSia -I} and Ecave[S] = \nTaking derivatives yields: &g Evex [S] = \n2:i jab TijabSiaSjb + 2:ia BiaS ia \u00b7 \nTI~~Sia + Pa and &t E cave [S] = 2 2: j,b TijabSjb + Bia\u00b7 Applying eeep by setting \n&:s::~ (StH) = - &:5;:e (st) gives T{l + log Sia (t + I)} + Pa = -2 2:j,b TijabSjb(t)(cid:173)\nBia\u00b7 We solve for the Lagrange multipliers {Pal by imposing the constraints \n2:i Sia(t + 1) = 1, Va. This gives a discrete update rule: \n\nSia (t + 1) = \n\n(-1/T){2 2:. b TijabSjb(t)+Oia} \ne \n2:c e( -1/T){2 \n\nj,b TijcbSjb(tl+Oi c } \n\n2: \n\nJ, \n\n' \n\n. \n\n(4) \n\nAlgorithms of this type were derived in [lO}, [3} using different design principles. \n\nOur second example relates to the ubiquitous EM algorithm. In general EM and \nCCCP give different algorithms but in some cases they are identical. The EM algo(cid:173)\nrithm seeks to estimate a variable f* = argmaxt log 2:{I} P(f, l), where {f}, {l} are \nvariables that depend on the specific problem formulation. It was shown in [4] that \nthis is equivalent to minimizing the following effective energy with respect to the \nvariables f and P(l): E ett [!, P(l)] = - ~ 2:1 P(l) log P(f, l) + ~ 2:{I} P(l) log P(l). \nTo apply CCCP to an effective energy like this we need either: (a) to decompose \nE ett [!, P(l)] into convex and concave functions of f, P(l), or (b) to eliminate either \n\n\fvariable and obtain a convex concave decomposition in the remaining variable (d. \nTheorem 4). We illustrate (b) for the elastic net [2]. (See Yuille and Rangarajan, \nin preparation, for an illustration of (a)). \n\nExample 2. The elastic net attempts to solve the Travelling Salesman Problem \n(TSP) by finding the shortest tour through a set of cities at positions {Xi }' The \nelastic net is represented by a set of nodes at positions {Ya} with variables {Sial \nthat determine the correspondence between the cities and the nodes of the net. Let \nE el I [S, 171 be the effective energy for the elastic net, then the {y} variables can be \neliminated and the resulting Es[S] can be minimized using GGGP. (Note that the \nstandard elastic net only enforces the second set of linear constraints). \nDiscussion. The elastic net energy function can be expressed as [11]: \n\nia \n\na,b \n\ni,a \n\nwhere we impose the conditions L:a Sia = 1, V i and L:i Sia = 1, V a. \nThe EM algorithm can be applied to estimate the {Ya}. Alternatively we can solve \nfor the {Ya} variables to obtain Yb = L:i a PabSiaXi where {Pab } = {Jab + 2')'Aab} -1. \nWe substitute this back into E ell [S, 171 to get a new energy Es[S] given by: \n\ni ,j,a,b \n\ni,a \n\n(6) \n\nOnce again this is a sum of a concave and a convex part (the first term is concave \nbecause of the minus sign and the fact that {Pba } and Xi . Xj are both positive semi(cid:173)\ndefinite.) We can now apply GGGP and obtain the standard EM algorithm for this \nproblem. (See Yuille and Rangarajan, in preparation, for more details). \n\nOur final example is a discrete iterative algorithm to solve the linear assignment \nproblem. This algorithm was reported by Kosowsky and Yuille in [5] where it was \nalso shown to correspond to the well-known Sinkhorn algorithm [9]. We now show \nthat both Kosowsky and Yuille's linear assignment algorithm, and hence Sinkhorn's \nalgorithm are examples of CCCP (after a change of variables). \n\nExample 3. The linear assignment problem seeks to find the permutation matrix \n{TIia} which minimizes the energy E[m = L:ia TIia A ia , where {Aia} is a set of \nassignment values. As shown in [5} this is equivalent to minimizing the (convex) \nEp[P] energy given by Ep[P] = L:aPa + ~ L:i log L:a e-,B(Aia+Pa) , where the so(cid:173)\nlution is given by TI;a = e-,B(Aia+Pa) / L:b e-,B(Aib+Pb) rounded off to the nearest \ninteger (for sufficiently large fJ). The iterative algorithm to minimize Ep[P] (which \ncan be re-expressed as Sinkhorn's algorithm, see [5}) is of form: \n\n(7) \n\nand can be re-expressed as GGGP. \nDiscussion. By performing the change of coordinates fJPa = - log r a V a (for r a > \n\n\f0, Va) we can re-express the Ep[P] energy as: \n\n(8) \n\nObserve that the first term of Er[r] is convex and the second term is concave (this \ncan be verified by calculating the Hessian). Applying CCCP gives the update rule: \n\n1 \n\na \n\nrt+l = 2:= 2::: e-,BAibrt' \n\ni \n\nb \n\ne-,BAia \n\nb \n\n(9) \n\nwhich corresponds to equation (7). \nExample 4. The Generalized Iterative Scaling (GIS) Algorithm [ll for estimating \nparameters in parallel. \nDiscussion. The GIS algorithm is designed to estimate the parameter X of a distri(cid:173)\nbution P(x : X) = eX.\u00a2(x) IZ[X] so that 2:::x P(x; X)\u00a2(x) = h, where h are observa(cid:173)\ntion data (with components indexed by j.t). It is assumed that \u00a2fJ,(x) ::::: 0, V j.t,x, \nhfJ, ::::: 0, V j.t, and 2:::fJ, \u00a2fJ, (x) = 1, V x and 2:::fJ, hfJ, = 1. (All estimation problems of \nthis type can be transformed into this form [lj). \nDarroch and Ratcliff [ll prove that the following GIS algorithm is guaranteed to \nconverge to value X* that minimizes the (convex) cost function E(X) = log Z[X]-X.h \nand hence satisfies 2:::x P(x; X*)\u00a2(x) = h. The GIS algorithms is given by: \n\nXt+! = Xt - log ht + log h, \n\n(10) \n\nwhere ht = 2:::x P(x; Xt )\u00a2(x) {evaluate log h componentwise: (log h)fJ, = log hf),') \nTo show that GIS can be reformulated as CCCP, we introduce a new variable \niJ = eX (componentwise). We reformulate the problem in terms of minimizing \nthe cost function E,B [iJ] = log Z[log(iJ)] - h . (log iJ). A straightforward calcula(cid:173)\ntion shows that -h . (log iJ) is a convex function of iJ with first derivative being \n-hi iJ (where the division is componentwise). The first derivative of log Z[log(iJ)] is \n(II iJ) 2:::x \u00a2(x)P(x: log ,8) (evaluated componentwise). To show that log Z[log(iJ)] is \nconcave requires computing its Hessian and applying the Cauchy-Schwarz inequality \nusing the fact that 2:::fJ, \u00a2fJ,(x) = 1, V x and that \u00a2fJ,(x) ::::: 0, V j.t,x. We can there-\nfore apply CCCP to E,B [iJ] which yields l/iJH1 = l/iJt x Ilh x ht (componentwise) , \nwhich is GIS (by taking logs and using log ,8 = X). \n\n5 Conclusion \n\nCCCP is a general principle which can be used to construct discrete time iterative \ndynamical systems for almost any energy minimization problem. It gives a geomet(cid:173)\nric perspective on Legendre minimization (though not on Legendre min-max). \n\nWe have illustrated that several existing discrete time iterative algorithms can be re(cid:173)\ninterpreted in terms of CCCP (see Yuille and Rangarajan, in preparation, for other \n\n\fexamples). Therefore CCCP gives a novel way ofthinking about and classifying ex(cid:173)\nisting algorithms. Moreover, CCCP can also be used to construct novel algorithms. \nSee, for example, recent work [13] where CCCP was used to construct a double loop \nalgorithm to minimize the Bethe/Kikuchi free energy (which are generalizations of \nthe mean field free energy). \n\nThere are interesting connections between our results and those known to mathe(cid:173)\nmaticians. After this work was completed we found that a result similar to Theorem \n2 had appeared in an unpublished technical report by D. Geman. There also are \nsimilarities to the work of Hoang Tuy who has shown that any arbitrary closed \nset is the projection of a difference of two convex sets in a space with one more \ndimension. (See http://www.mai.liu.se/Opt/MPS/News/tuy.html). \n\nAcknowledgements \n\nWe thank James Coughlan and Yair Weiss for helpful conversations. Max Welling \ngave useful feedback on this manuscript. We thank the National Institute of Health \n(NEI) for grant number R01-EY 12691-01. \n\nReferences \n\n[1] J.N. Darroch and D. Ratcliff. \"Generalized Iterative Scaling for Log-Linear Models\". \n\nThe Annals of Mathematical Statistics. Vol. 43. No.5, pp 1470-1480. 1972. \n\n[2] R. Durbin, R. Szeliski and A.L. Yuille.\" An Analysis of an Elastic net Approach to \n\nthe Traveling Salesman Problem\". Neural Computation. 1 , pp 348-358. 1989. \n\n[3] LM. Elfadel \"Convex potentials and their conjugates in analog mean-field optimiza(cid:173)\n\ntion\". Neural Computation. Volume 7. Number 5. pp. 1079-1104. 1995. \n\n[4] R. Hathaway. \"Another Interpretation of the EM Algorithm for Mixture Distribu(cid:173)\n\ntions\" . Statistics and Probability Letters. Vol. 4, pp 53-56. 1986. \n\n[5] J. Kosowsky and A.L. Yuille. \"The Invisible Hand Algorithm: Solving the Assignment \nProblem with Statistical Physics\". Neural Networks. , Vol. 7, No.3 , pp 477-490. 1994. \n\n[6] E. Mjolsness and C. Garrett. \"Algebraic Transformations of Objective Functions\". \n\nNeural Networks. Vol. 3, pp 651-669. \n\n[7] A. Rangarajan, S. Gold, and E. Mjolsness. \"A Novel Optimizing Network Architec(cid:173)\n\nture with Applications\" . Neural Computation, 8(5), pp 1041-1060. 1996. \n\n[8] A. Rangarajan, A.L. Yuille, S. Gold. and E. Mjolsness.\" A Convergence Proof for \nthe Softassign Quadratic assignment Problem\". In Proceedings of NIPS '96. Denver. \nColorado. 1996. \n\n[9] R. Sinkhorn. \"A Relationship Between Arbitrary Positive Matrices and Doubly \n\nStochastic Matrices\". Ann. Math. Statist .. 35, pp 876-879. 1964. \n\n[10] F.R. Waugh and R.M . Westervelt. \"Analog neural networks with local competition: \n\nL Dynamics and stability\". Physical Review E, 47(6), pp 4524-4536. 1993. \n\n[11] A.L. Yuille. \"Generalized Deformable Models, Statistical Physics and Matching Prob(cid:173)\n\nlems,\" Neural Computation, 2 pp 1-24. 1990. \n\n[12] A.L. Yuille and J.J. Kosowsky. \"Statistical Physics Algorithms that Converge.\" Neu(cid:173)\n\nral Computation. 6, pp 341-356. 1994. \n\n[13] A.L. Yuille. \"A Double-Loop Algorithm to Minimize the Bethe and Kikuchi Free \n\nEnergies\" . Neural Computation. In press. 2002. \n\n\f", "award": [], "sourceid": 2125, "authors": [{"given_name": "Alan", "family_name": "Yuille", "institution": null}, {"given_name": "Anand", "family_name": "Rangarajan", "institution": null}]}