
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The approach being taken here is to use copulas to capture dependencies between units which are missed by mean field. This has some strong similarities to higher order mean field techniques, where higher and higher order dependencies between units are captured by additional terms in a mean field expansion. Should include some discussion of this relationship / cite it as another approach for capturing higher order dependencies.
Abstract  make explicit that the problem you are addressing is inference over latent variables
37  'corresponding to running only one step of the algorithm' this doesn't make sense yet here. Expand on it here, or save it until you've given context.
44  I have a strong skepticism of this as a black box method. Copula methods (I believe even the vine technique you use) typically remove dependencies along one axis in state space at a time. This is great if the problem is low dimensional or has a known structure with low order or simple hierarchical interactions. In general though, there will be an exponential number of dependencies between units, and each time state space is rescaled using the CDF along one axis in order to remove one dependency, the rest of the dependencies are distorted and made more complex.
110  Flip the top row in figure leftright so that the graph is oriented the same way in each row.
142145  Confusing. rho is a ?measured? correlation capturing pariwise dependencies? And in the case of a 2d Gaussian will lead to an exact model match?
Section 2.2.1  How is the edge structure chosen for the graph used for the vine? I would imagine this is crucial to performance and application as a black box method.
183  conditioning on D(e). Doesn't the size of the state space you are conditioning on increase exponentially with depth? Is there a function for the dependence? Or a shallow maximum depth?
I think layer T_j is acting on the outputs of layer T_{j+1}, i.e. after the units have already been transformed by T_j? I found this confusing in the text though.
185  'rigorizes' > 'makes rigorous'
238  think this should be 'minorizationmaximization'
251  'the restriction' this is not a restriction, just a common choice
270  where did v come from? Is it just a new variable name being used to hold sample output?
Section 3.2  This section could be simplified, to take home message and gradient formula, with details moved to appendix.
Figure 3  Use smaller points (and maybe alpha) so can see both distributions when they overlap. Or mix plotting order so the colors don't occur in occluding layers.
Table 1  Wouldn't comparisons to black box variational inference methods be more appropriate here? Or higher order mean field methods like TAP?
To me all the examples in the experimental section seem very low dimensional. I'm very curious how performance scales with the dimensionality of the state space. Can you say anything about this?
419  That HMC could not be made to sample from this 10dimensional latent distribution over many hours seems *very* difficult to believe. How did you identify nonconvergence? How did you choose the hyperparameters? What implementation did you use?
Q2: Please summarize your review in 12 sentences
The paper presents a method to iteratively use copulas to improve mean field (or other structured) approximations to a distribution. The underlying idea seems sensible, and the writing style was good. I was confused or skeptical about the details, and the experiments were not very convincing.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
I like the basic idea of the paper;
using copulas to restore some of the dependencies which are broken in mean field variational Bayes seems like a good idea and the use of vine copulas seems to provide some useful efficiencies in the computations.
However, whether this generalizes well to high dimensions depends very much on whether the large number of parameters describing the dependence structure can be efficiently learnt.
Perhaps some of these details are explained in the supplementary material, but the supplementary material available on CMT is for a completely different paper so I'm unable to say.
The numerical experiments are a bit limited as they stand.
There is a lowdimensional example involving mixtures and a very brief description of a more interesting example involving a latent space model.
In both cases nonindependence copulas are used to describe dependence involving covariance matrix parameters and I wondered whether the positive definiteness constraint was difficult to handle within the copula framework  is some kind of reparametrization required for that?
If I've understood correctly in the latent space model a copula is learnt to describe dependencies within each block of latent node attributes as well as for the mean and covariance matrix of node attributes so the number of copula parameters learnt here is large.
I'm not sure exactly how the comparisons with LRVB are done;
LRVB after all simply gives an improved estimate of the posterior covariance matrix.
For nonGaussian variational factors I wasn't sure precisely how that was used to derive an improved predictive inference that could reasonably be compared with the CVI approach.
So I'm not clear on what was done here for the classification in the MNIST data for the two component mixture example and how the LRVB was used to get a comparable predictive likelihood to CVI for the latent space model.
Q2: Please summarize your review in 12 sentences
This paper proposes a way of learning an appropriate dependence structure for variational approximations using copulas.
The methodology seems highly original and quite general but I did find details of some of the numerical experiments a bit unclear.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The typical meanfield approximation made in variational inference results in a wellknown range of issues, from poor approximation of the distribution as a whole to poor covariance estimates in particular. This is a pressing problem for variational inference, and the authors aim to address it by introducing copulas into the variational approximating distribution. The authors' experimental results suggest that this may be a promising direction for improving on variational approximations. However, some details of the method are not covered as much as desired, and the experiments seem somewhat limited.
Major points
(1) In order to run their method, the authors need to choose a vine and a copula. While the authors spend a little time discussing that they choose these according to an algorithm, it's not clear what the approximative significance of the vine in particular is. Why are some vines better than others? Why are some copulas better than others? Are all possible vines considered? Is that problematic in manyparameter models? How are the possible copulas chosen? Is there any reason to think these choices of copulas are enough?
E.g., on the tree point,
p. 4: line 163168: Where does the first tree come from? Can it be any tree on the model variables? Does it matter?
p. 5, line 243: What does it mean to learn the tree structure? What is there to learn?
(2) The authors seem to use one model per performance metric: a Gaussian mixture model for variance modeling and a latent state space model for predictive likelihood. (Why not compare both models on both metrics?) This isn't necessarily bad; it could be a starting point for future exploration. But then the authors seem to make somewhat broad conclusions based on one model each. It's also unclear why the authors plot just one particular set of parameters from the Gaussian mixture model in Figure 3. Aren't there are other mixture model parameters (like the component means in Eq. 18)? For that matter, what are the hyperparameters in the each generative model (the ones used in the experiment)?
p. 8, line 393396: It seems from Fig 3 that CVI and LRVB both estimate variances equally well on this problem. LRVB focuses on covariances. Presumably the point of CVI is to get an entire posterior that is better. Table 1 gets at this a little, but it would be nice to see more comparisons on, say, estimates of other posterior statistics than just covariances. Or on other models. And why not compare to structured meanfield or other recent methods that get beyond meanfield for predictive likelihood?
p. 8, line 418: "CVI always outperforms LRVB". This statement seems a bit broad to be warranted from the experiments presented. The authors might reasonably say CVI outperforms LRVB in predictive likelihood (based on Table 1)but does it outperform across other models besides just the one mixture model presented? Or do the authors really mean to say it outperforms LRVB in other ways? It is unclear how the two compare in covariance estimation (besides looking roughly the same in Figure 3).
Minor points
p. 1, line 3537: This sentence is a little difficult to parse on the first pass. It sounds like the authors are saying it's a generalization but also a special case. Presumably they are just saying that their algorithm iterates between meanfield steps and copula steps. Does "structured factorizations" here refer to [19]? If so, it would help the reader to be explicit.
p. 2, line 83: It seems a little unusual of a notational choice to use z for all latent variables; it seems to be usually reserved for indicators
p. 2, line 84: KL is technically not a distance since it is not symmetric
p. 3, line 143: "the bivariate Gaussian distribution with zero mean and Pearson correlation coefficient $\rho$" < Does this distribution have arbitrary strictly positive variance parameters in both directions? (or all the variances one since the authors refer to "the bivariate Gaussian"?)
p. 5, line 247248: What does this mean; e.g., what are these "future iterations"? Are the authors saying that the choice of tree and copula family is stable across multiple independent runs of the algorithm? If so, what changes from run to run that this choice might be different? If not, what are the iterations? What is the significance of the lack of change described?
p. 7, line 357: "reduce the variance" < The authors might want to use more careful language here to clarify what variance is being reduced since the word "variance" is being used in a few different ways around this point in the text.
p. 8, Table 1 seems to be missing running time. It's in the text, but it's a little hard to parse the comparison in paragraph form. Is there any reason not to put it in the table?
p. 8, line 389: "consistent estimates" < This seems to be a poor use of the word "consistent" in the statistical sense (there's no limit as the number of data points becomes infinite). Perhaps "accurate" is the desired word.
p. 8, line 412: This paragraph is a little hard to read. It sounds like CVI took hours, but then the authors say it's better in runtime. Perhaps this could be clarified.
 After Author Feedback 
The author feedback seemed on point; I look forward to reading more detail about the vine methodology and a broader range of experiments in the final version.
Q2: Please summarize your review in 12 sentences
The authors propose a more accurate approximating set of distributions for variational inference than the meanfield assumption allows by introducing copulas. While their experiments are suggestive of a promising method, explaining the inference of vines and copulas and using a wider variety of models (or at least their existing performance metrics applied across both models they try) would improve the paper.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their constructive
feedback. We would like to emphasize that the novelty of the method, which
addresses how to efficiently learn the dependency between latent variables
without explicit knowledge of the model, has been accepted as valid and
legitimate by the reviewers. We are confident this is a useful
contribution for making generic inference viable in practice. Omitted
comments will be fixed in revision if accepted.
1. R1 questions the
applicability of copulas for modelling dependence in a black box
framework.
Vine copulas have been successful in modelling
dependence for unknown highdimensional distributions, see, e.g.,
Schirmacher and Schirmacher (2008), Chollete et al. (2009), Heinen and
Valdesogo (2009), de Melo Mendes et al. (2010), Czado et al. (2012), and
Nikoloulopoulos et al. (2012).
2. R1 and R2 ask whether the
procedure in selecting the vine structure and copula families is
sufficient.
The procedure we use for selecting them closely follows
the literature (Dissman et al., 2012). Because of the computational
infeasibility of searching through all vine structures, we use a greedy
approach that sequentially selects each tree based on Kendall's tau. The
intuition is that the largest dependencies should appear at the
highestlevel trees because changes in the copula parameters propagate
from top to down during inference; less significant dependencies should
appear at the lowest level where they are not as important to accurately
estimate. Copula researchers regard this strategy as the most successful
approach in practice, and it is used in all current vine software: UNICORN
by Ababei et al. (2007), CDVine by Brechman and Schepsmeier (2013), and
VineCopula by Schepsmeier et al. (2015). We will add more details in the
paper.
3. R1 and R3 ask about the scalability of the method to
higherdimensional problems not present in the experiments.
The
algorithm scales linearly with the number of pairwise dependencies
modelled by the copula. In the most general case where one incorporates
all dependencies, the algorithm is quadratic in the number of latent
variables, which is unavoidable. In the examples we consider, the
algorithm models dependencies over all local variables, making the
dimension of the problems as large as the number of data points itself.
The experiments demonstrate its feasibility in such a scenario, as in fact
not all copula parameters are updated given a subsample of the data during
the stochastic approximations. We will clarify this in the
paper.
Regarding models with higherdimensional global variables,
we have studied the mixture of Gaussians where the multivariate normals
have dimension 1000. We note that computation time is not affected because
the main bottleneck is modelling dependence over the local variables. We
will add this experiment to the supplement.
4. R2 asks about the
performance metrics in the experiments.
We have looked at
predictive accuracy for both examples, c.f., prediction with the mixture
of Gaussians for the MNIST digits. Other mixture model parameters in the
experiment were omitted from the plot only because of similar results and
space constraints. As R1 suggests, we will move details for calculating
the gradient into the appendix and, as R2 suggests, use the space for
benchmarking other statistics of the posterior, including tails. This will
reinforce and better clarify the superiority of our method to
LRVB.
5. R1, R2, and R5 ask for comparisons to other
algorithms.
R1: Higher order meanfield methods such as TAP are
closely related to LRVB which we compare to. They can be seen as a Taylor
series correction based on the difference between the posterior and its
meanfield approximation (Kappen and Wiegerinck, 2001). As KW (2001)
mention, higher order methods assume the meanfield approximation is
reliable for the Taylor approximation to make sense, which is not true for
all probability distributions and thus is not robust in a black box
framework. We also show this sensitivity in our experiments for LRVB; see
L393400. We will comment more on this in the paper.
R1: Other
black box methods, e.g., Ranganath et al. (2014), require specification of
the variational family. We compare to this in our experiments where the
variational family is the meanfield.
R1: HMC is slow in our
experiments because it evaluates a gradient over all data points per
iteration.
R2: Comparisons to structured approximations will be
dependent on the structure imposed by the choice of factorization for each
model. Moreover the proposed procedure using copulas does not aim to
compete against explicit factorizations, which are not black box; rather
it augments them by modelling any dependencies which do not already exist
in the factorization.
R5: A full rank Gaussian approximation cannot
be compared to as it does not support discrete variables such as the
membership assignments in both experiments. 
