
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Overview: This paper studies the benefits of augmenting the linear programming relaxation of the maximum aposteriori (MAP) inference problem in graphical models with a quadratic term, thereby achieving strong convexity.
Such augmented formulations are obtained both from the original primal and dual formulations, and in each case the resulting primaldual relationship is studied.
Prior work has mostly focused on smoothing the LP formulation using a softmax/entropy term, with a few notable exceptions, such as [5], [17] and [18].
Rather than those previous approaches, which employ a quadratic term in the subproblems of either a *proximal* or a *alternating direction* scheme, in the present manuscript, the quadratic smoothing term is added directly.
This can in some way be seen as a naive approach: In comparison to proximal or alternating direction schemes, convergence to the global optimum of the original problem is no longer guaranteed, and the approximation quality directly depends on the strength of the augmentation term.
The aforementioned approaches adjust the strength in an adaptive manner and thus find the global optimum, at least in the limit.
However, the formulations derived from this direct augmentation are very insightful and reveal several interesting connections to other formulations.
Moreover, the authors demonstrate in several experiments (both synthetic and realworld), that a particular optimization scheme for the primal form of the augmented dual (based on blockcoordinate FrankWolfe) compares very favorably to other practical schemes if a commensurate strength of the quadratic penalty is chosen.
The FrankWolfe method has recently seen numerous useful applications in machine learning, and this is yet another interesting example.
Positive points: + The paper is exceptionally wellwritten and organized. Related work is cited and the authors do an excellent job at connecting the various existing approaches and putting them into perspective. + The suggested optimization scheme (FW) seems practical.
It is reasonably efficient and has the advantage that only maximization (as opposed to maxmarginalization, or even softmaxmarginalization) is needed, which allows for the use of certain structured potentials. + The experimental evaluation is unusually thorough for a paper of this type, involving even inference experiments based on realworld learned structured prediction models. + The paper seems to be technically correct as far as I can tell. + Some of the derivations are novel to my knowledge, and the established optimization scheme employes the recently suggested blockcoordinate FrankWolfe scheme for simplexconstrained quadratic optimization in a novel context. + Table 1 is extremely helpful in connecting the dots.
Negative points:  From a convex optimization point of view, the proposed scheme (FW) is less advanced than either of [5], [17], and [18].  In fact, as the authors point out, their augmented formulation is also not entirely novel, but very closely related to the subproblem in [17] (or the one in [5], if the primal is directly augmented).
As such, some of the derivations and bounds in the manuscript also build on material developed in previous work.
A comment: The quadratic programming relaxation suggested by Kumar [4] is also a simplexconstrained convex quadratic program.
It might be interesting to extend the discussion in the paper to highlight which, if any, similarities exist between the softconstrained primal formulation suggested in this manuscript and Kumar's convex QP relaxation  in particular from an approximation quality point of view. (Note that by adding a quadratic augmentation term, as suggested in the present manuscript, the approximation also becomes one that differs from the original LP relaxation.)
Q2: Please summarize your review in 12 sentences
This is a very wellwritten paper that makes strong contributions to the literature on approximate MAP inference in graphical models.
The contributions are twofold: First, the manuscript helps consolidate and connect the vast number of existing approaches; and second, a novel approach to MAP inference/optimization scheme is suggested that seems to have a lot of merit in itself.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper introduces three different algorithms for discrete MAP inference based on the LP relaxation. By adding small terms to the primal, the dual or both, smoothness, strong convexity, or both can be obtained. Smoothness avoids getting stuck in local optima. Convergence properties and bounds on the original objective function are derived.
The paper is very wellwritten and well organized. It introduces and characterizes each of the variations of the LP, describing their advantages and shortcomings in great detail and with proper references to the literature.
I like the summary in Table 1. However, when consulted directly, it is unclear whether the smoothness/stronglyconvex properties apply to the primal or the dual. For instance, when looking at the cell corresponding to 4.2, the table assigns it to the category "convex", but the primal is actually "strongly convex".
The experimental section is reasonably complete and shows the advantages and disadvantages of a number of methods and parameterizations, illustrating where the proposed approaches excel or fail in relation with existing literature. The need for MAP inference is pervasive, so this paper is potentially very significant (depending on how well the proposed algorithms perform on a wider array of problems).
Something that was not commented in the paper is in which range the introduced parameters (lambda, gamma) are actually usable, since I assume that setting very small values for these parameters results in accumulated rounding errors.
Q2: Please summarize your review in 12 sentences
This work provides variations of the LP relaxation for MAP inference that favor the convergence of different types of optimizers. Good paper with exhaustive references, clear and complete exposition and adequate experimental validation.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Most relaxationbased methods for inference in graphical models work by relaxing the problem of inference in graphical models to convex constrained nonsmooth optimization problems, which are then solved, rounded, and whose results are used in downstream applications. Since solving convex nonsmooth optimization problems is hard (most existing algorithms either have poor convergence rates or poor computational complexity), this paper proposes smoothing down the problems by adding quadratic terms to either the primal or dual formulation. These lead to (some more, some less) natural applications of existing optimization algorithms which are known to have fast convergence rates at acceptable computational complexity. The paper evaluates these new methods on some models, obtaining promising results in terms of the value of the optimization function.
The main issue with this paper is that while the results seem correct (and indeed are quite straightforward to rederive) the main point of doing inference in graphical models is to use the output of inference as a part of some downstream task. Most relaxationbased methods make this quite easy as rounding is relatively straightforward except in some special cases where the relaxation is not tight (and even then approximate rounding is used).
Adding a quadratic term to either the primal or the dual inference objective will almost always lead to fractional solutions instead of integral solutions, making rounding harder, and almost arbitrary. Of course it's possible to do something similar to interior point methods and slowly anneal down the smoothness terms until you effectively have an integral solution, but this is not what this paper does.
Finally, there is already a relatively popular relaxationbased inference method which adds smoothness, reference [17] in the paper, Martins et al's AD3. It'd be very interesting to see comparisons between AD3 and this paper's method, as it's a closer comparison.
line 313 "Stochastic coordinate ascent" should read "stochastic dual coordinate ascent"
 Revised after author response
After thinking more about this, reading the author responses, and discussing with the other reviewers I'm reviewing my score upwards, as I think that using these algorithms as subroutines in other solvers can be useful, and the primal solutions do see to be quite good.
Q2: Please summarize your review in 12 sentences
The paper has an interesting idea for how to speed up convergence of LP relaxations for inference in graphical models at the expense of having a less useful result.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
By adding regularisation to either the dual or the primal LP relaxation problem, strong convexity is achieved which in turn leads to optimisation algorithm that converge faster. Extensive comparisons of various optimisation (/inference) schemes
is provided.
Q2: Please summarize your review in 12 sentences
MAP inference is an important problem and this work is a significant contribution towards faster LP inference schemes. The sectioning is reasonable, the paper well written (albeit it could be improved, i.e. experimental section).
Submitted by Assigned_Reviewer_5
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors describe a set of optimization methods for the LP relaxation of MAP, based on various quadratic (and thus strongly convex) smoothing mechanisms in either the dual or primal formulations of the problem.
Strong convexity allows them to give fast asymptotic convergence rates for gradient and accelerated gradient optimization, at the cost of optimizing a slightly different objective function, whose difference they bound from the original.
The paper is very well written, with relatively good coverage of the many existing approaches to this problem and a clear description of how the authors' method differs from them.
The authors' approach seems novel to the best of my knowledge.
The experiments show the approach compares favorably with standard coordinate descent and subgradient descent methods, as well as producing better estimates with simple local decoding.
The paper does have a few weaknesses.
The LP bounds on MAP are very well studied, and most of the concepts used here have been applied previously in slightly different ways.
For example (as they discuss briefly) the authors' approach is very closely related to proximal methods and augmented Lagrangian methods, although they study singleloop, smoothed objectives rather than doubleloop forms.
One positive is the careful analysis of various convex & strongly convex smoothing options (e.g., Table 1), which I think gives a useful taxonomy of methods.
However I am not entirely convinced that the methods, or their relative convergence rates, have a significant practical impact.
At a subjective level, they behave much the same way other smooth, globally convergent methods do.
With regards to the improvements in local decoding, this is a fairly simple baseline compared to decoding using search or branchandcut methods, or even a simple greedy decoding or a small amount of local search.
Overall I think the paper is very well written and would like to see it published.
I gave it a slightly lower score because I did not find the results particularly exciting and despite good experimental assessment of the paper's claims, I am not sure if they will turn out to be practically useful.
Q2: Please summarize your review in 12 sentences
A very well written paper on several quadratically smoothed variations on the LP relaxation for MAP.
The quality and clarity are excellent, but it is very closely related to many existing approaches, and might not necessarily have a significant impact in practice.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for the positive
feedback.
Reviewers 1, 3 and 6:  Connection to proximal methods
and ADMM: As mentioned in Section 4, a more accurate solution can be
obtained by using our method in the inner loop of a proximal method, ADMM,
or an annealing procedure. We believe that this fact actually makes our
contribution stronger. First, we show how to solve the problems in the
inner loop of these algorithms more efficiently and provide new insights
into the resulting problems via primaldual relationships. Second, we show
that in some applications a more relaxed objective function is sufficient
for obtaining a highquality solution, but with a significant
computational gain by avoiding the outer loop.
Reviewers 4 and
6:  Regarding lack of guarantees for a decoded/rounded solution, we
would like to point out that due to the inherent hardness of the
underlying combinatorial problem, the quality of a rounded solution can
only be guaranteed under additional conditions on the problem, for example
pairwise factors and certain potentials (see further discussion in [4] and
[5]). Although it is interesting to study our approach in these special
cases, we choose to keep the presentation as general as possible. We feel
that such exploration is beyond the scope of the current paper, and
perhaps more appropriate for an extended journal version. We do provide
some empirical evidence on the quality of rounded solutions in the paper.
In particular, in our experiments (Fig. 2, left and right) we show the
value of a decoded primal solution (dashed lines) along with the optimized
objective. These results demonstrate that our rounded solutions are very
competitive with those obtained by the baseline
algorithms.
Reviewer 1:  Connection of our approach to the
formulation in Kumar et al. [4]: It seems to us that although one of our
proposed objective functions (Eq. (8)) is also a QP with local simplex
constraints, the formulations are actually quite different. First, the
variables in [4] are indicators over singleton assignments, and pairwise
marginals are not even represented explicitly, which is somewhat different
from our pseudomarginals (mu). Second, the potentials used in the
objective function of [4] are modified such that the quadratic term
corresponds to pairwise scores, and they also ensure that the matrix is
PSD (so the QP is convex). In contrast, in our objective the linear term
contains the original potentials (singleton, pairwise, and highorder),
and the quadratic term corresponds to violation of marginalization
constraints (which are implicit in [4]). Therefore, it currently seems
hard to draw direct connections between these two forms and compare their
approximation quality.
Reviewer 2:  In Tab. 1, the terminology
pertains to the dual objective. We apologize for the confusion and will
clarify in a revised version.  Regarding the range of useful
smoothness/convexity parameters: In our experiments we found that our
approach was not very sensitive to the precise values, and any choice in
the rage 0.10.0001 gave a reasonable tradeoff between accuracy and
runtime. We will add a comment in the revision.
Reviewer 3: 
Regarding practical impact we emphasize that among algorithms based on
factor maximization (as opposed to more involved oracles discussed, e.g.,
in [20]), our FW algorithm (Alg 1) is a significant addition to the
limited number of convergent alternatives to subgradient. As some
reviewers point out and as we also demonstrate in the experiments, this
may be very important in applications employing certain (highorder)
structured potentials, especially since subgradient may be very slow in
practice. In particular, we have shown its usefulness in a common vision
task (segmentation). Similar potentials are also very common in NLP
applications (e.g., parsing).

