__ Summary and Contributions__: The paper focuses on learning 3D rotations on the SO(3) space in the form of probability distributions. As the authors explain, the SO(3) manifold is non-linear and closed (bounded), whereas deep neural networks normally return unbounded activations. Thus, it is not straightforward how to map neural network activations on the SO(3) manifold without restoring to more explicitly structure in the neural network definition. To this end, the paper proposed to model SO(3) rotations with the Fisher distribution. Although the Fisher distribution is unimodal, meaning that it will not be able to disambiguate between symmetries, it can model general rotations in a flexible manner.

__ Strengths__: - The problem that the paper addresses is very relevant and emerging. While most works so far approached the learning of rotations in a straightforward way, ignoring the geometric properties of the underlying rotation manifolds, this paper explains very clearly and eloquently the necessity for better geometrical representations.
- The matrix Fisher distribution is a very good fit to the problem definition of modelling SO(3) rotations, as its parameters have no constraints (similar to what neural networks can model) and the conditioning variable, F, can obtain the parameterizing structure of rotations (R^3x3 matrix). The resulting negative log-likelihood has interesting properties, like being Lipschitz continuous, although unfortunately the paper does not dwell much on this.
- Results look consistently better compared to recent state-of-the-art baselines [13, 16] and over a different number of classes.

__ Weaknesses__: - As also the authors explain, the proposed solution for the intractable normalizing constant is inelegant and also rather unclear. It is not clear how the function is approximated, given that the more complex version is not possible to know. Is some variational approximation used to make sure that the obtained function minimizes some distributional metric on the true normalizing constant? Also, is there a way to get an estimate of the crudeness of the approximation, perhaps on a simpler toy setting where one can solve the problem computationally?
- Importantly, given that the approximation of the normalizing constant is it still fair to claim that the output predictions lie on the SO(3) manifold? Given the high dimensionality (9 dimensions are still quite high geometrically), is it fair to say that the approximation may lie quite far from the original SO(3) manifold? In that case, what exactly is the manifold learned and how does it relate to the SO(3) manifold?
- Given the unimodality of the distribution, what is the convergence behavior of the algorithm? Is it observed that it may get stuck to bad local optima, especially for classes for which there is either symmetry or the visual similarity across different access is not very salient for the model to capture it well, especially in the early stages of the training?
- Some parts in the text could be written more clearly. For instance,
-- could the authors explicitly explain what is a proper rotation matrix in line 97?
-- what exactly is meant in l. 105-106 regarding solving the problem of the matrix being non positive semidefinite?

__ Correctness__: Yes, the claims and the empircal methodology appear correct.

__ Clarity__: Yes, the paper is generally clearly written.

__ Relation to Prior Work__: Relation to prior work is well covered.

__ Reproducibility__: Yes

__ Additional Feedback__: Generally, I am positive of the submission, and I am lowering my score a bit because of the few weaknesses I listed above. I would be happy to raise my score with a convincing rebuttal.

__ Summary and Contributions__: This is a paper about modeling the distribution over SO(3) and using this distribution to compute the negative log-likelihood as a loss function in the training of a neural network, yielding a parameterized distribution as prediction. The authors provide an experimental study in 3D pose estimation from 2D images, rotated 3D model projections, and 3D poses from 2D head keypoints.

__ Strengths__: This is a careful analysis of applying the matrix Fisher distribution
as a representation for the conditional probability distribution output of a network learning an SO(3) element. A lot of calculations are taken from the excellent treatment in [11] and most of the theoretical grounding is in [11].
- The derivation of the constant of the Fisher distribution in the supplemental material is very helpful in obtaining an intuition about the influence of the singular values.
- The authors show a method following the definition in [4] on how to compute the derivatives of the normalizing constant, a perplexing task given the combinatorial definition of the hypergeometric function.
- Another strength is the convexity proof about the loss as well as the Lipschitz continuity of the loss itself and its derivatives.
- An ablation analysis was run on the experiments with respect to data augmentation, class embedding, and homography preprocessing.
- The method shows a superior performance in PAscal3D+ and on ModelNet10-SO(3).

__ Weaknesses__: The following is a list of things I miss or I might have misunderstood and the authors should respond.
- Regarding the backpropagation it is not obvious to the reader whether the pytorch SVD differentiation was used or any other method brewed by the authors (apologies for not looking at the provided code).
- To understand the influence of the distribution I would appreciate an experiment when only the trace(F^T R) is used as a loss without a normalizing constant. A geodesic distance is used by Mahendran but to have a fair comparison it would be good to run the author's implementation just with the trace term.
- While the authors provide qualitative results for ambiguous case with an elongated marginalized distribution ([11]-based visualization). However, they do not provide any table about using the 2nd moment of the contributions (formulae in [11]) and how it correlates to the actual accuracy (like figure 1 in [7]). That would be extremely helpful to see how well the Fisher distribution reflects accuracy.
- The closest approach to this work is [5]. Unfortunately, they have not tried Modelnet10 or Pascal3D+ but the reader remains with the open question whether any probabilistic approach would perform better.

__ Correctness__: The theory is correct and as a matter of fact the most interesting piece of the paper is the caluclations in the supplement.
However, the empirical methodology lacks in two respects and leaves the reader with open questions:
- Is the addition of the normalizing constant in the loss function the main factor that improved performance (vs just using the trace).
- If this is indeed the case, authors have to compare with [5]. That would provide a validation for using the constant and would provide a straight comparison between Fisher and Bingham.

__ Clarity__: The paper is very well written. Parts of the supplement on theory belong to the main paper (for sure the geometric interpretation).

__ Relation to Prior Work__: - One more paper that containing useful formulae (preceding [11])
Sei, T., Shibata, H., Takemura, A., Ohara, K., & Takayama, N. (2013). Properties and applications of Fisher distribution on the rotation group. Journal of Multivariate Analysis, 116, 440-455.
and a paper with a simple implementation
Gaussian distributions on Lie groups and their application to statistical shape analysis
PT Fletcher, S Joshi, C Lu, SM Pizer, 2003
might be worth citing.
- The books by Greg Chirkjian.
- Experimental comparison to [5].

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper proposes using logarithm of the matrix Fisher distribution as a loss for training general purpose predictors of object orientation. To allow gradient based learning, the authors derive efficient method to compute gradient of the loss where the most important component is a hand-crafted approximation of the normalization function of the Fisher distribution. The method is empirically evaluated on three computer vision benchmarks and shown to achieve state-of-the-art results.

__ Strengths__: Designing appropriate loss for learning object orientation predictors is a challenging problem with strong practical impact. The proposed model seems to be a valuable contribution which is both technically sound and provides empirical improvements against existing methods.

__ Weaknesses__: I am missing an experiment that would compare the proposed loss with the existing alternatives in a controlled setting (i.e. using the same architecture, training algorithm, data and change only the loss) in order to clearly show see the differences/benefits. If I understand it correctly, the results reported for competing methods were adopted from corresponding papers using different settings.

__ Correctness__: The paper does not contain explicit form of the use approximation of the normalization constant. Hence it is difficult to verify the main claim that the proposed loss is convex.

__ Clarity__: yes

__ Relation to Prior Work__: yes

__ Reproducibility__: Yes

__ Additional Feedback__: The learned model provides a posterior distribution over the rotation matrices. Therefore, instead of using the distribution mode as the prediction, one could in principle use the Bayes optimal plug-in rule $argmin_{\hat{R}} \int_{R}p(R|X) d(R,\hat{R})$. The question is how difficult it is to solve the inference problem.
By the authors' own admission, a conceptual problem is unimodality of the distribution in presence of rotation symmetric objects. The paper may discuss more how to resolve the problem?
The proposed loss has some properties (like convexity, unimodality etc) that are claimed to be important. It would be helpful to see which of these properties are missing/present in case of the competing methods.
I have read the author feedback.