Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The writing of this paper is clear. The motivations of each designed loss function and the term are explained well. There are some questions I would like the authors to answer: 1. As far as I know, this submission is not the first work to introduce GCN for taking advantages of prior knowledge in Action Units. Please refer to this work "Semantic Relationships Guided Representation Learning for Facial Action Unit Recognition". I would appreciate it if there is a comparison with this work in the rebuttal, which can highlight the contributions. 2. This work achieves a 59.8 Avg on BP4D, but "Semantic Relationships Guided Representation Learning for Facial Action Unit Recognition" achieves 62.9. The performance gap seems a little large, doesn't it? 3. In the submission, it said that "dependency matrix" is utilized and therefore it gives me a feeling that only dependency relation is considered here. The mutually exclusive relations between Action Units are not considered here? But from Equation 8, it seems mutual exclusive relations are also considered. (Please correct me if I am wrong at this place.) 4. Lmv term, which orthogonalizes the weights between different classifiers, is not proposed in this work for the first time. As far as I know, the work "Taking A Closer Look at Domain Shift: Category-level Adversaries for Semantics Consistent Domain Adaptation" already proposes to use this. Please check equation 4 of that work.
The paper explores a method for exploiting multi-view training with label co-regularization for facial action unit recognition. Strengths: 1. A method for exploiting unlabeled data for the task of action unit recognition which is consistently data poor, so such a method could contribute a lot to the field. 2. A method for better multi-view training through orthogonalization and co-regularization of features 3. Promising results on standard datasets Weaknesses: 1. One major risk of methods that exploit relationships between action units is that the relationships can be very different accross datasets (e.g. AU6 can occur both in an expression of pain and in happiness, and this co-occurence will be very different in a positive salience dataset such as SEMAINE compared to something like UNBC pain dataset). This difference in correlation can already be seen in Figure 1 with quite different co-occurences of AU1 and AU12. A good way to test the generalization of such work is by performing cross-dataset experiments, which this paper is lacking. 2. The language in the paper is sometimes conversational and not scientific (use of terms like massive), and there are several opinions and claims that are not substantiated (e.g. "... facial landmarks, which are helpful for the recognition of AUs defined in small regions"), the paper could benefit from copy-editing 3. Why are two instances of the same network (resnet) are used as different views? Would using a different architecture instead be considered a more differing view? Would be great to see a justification for using two resnet networks. 4. Why is the approach limited to two views, it feels like the system should be able to generalize to more views without too much difficulty? Minor comments: - What is PCA style guarantee? - What is v in equation 2? - why are dfferent numbers of unlabeled images using in training BP4D and EmotioNet models? Trivia: massive face images -> large datasets donates -> denotes (x2) adjacent -> adjacency
Good Points: 1) Fairly organized. 2) Solid Experiment. 3) Ablations studies. Question to the authors: 1) Authors argued that two different networks learn different cues for AU recognition in Fig 1, however did not provide any solid evidence of their intuition. i.e., feature visualization or examples which can be correctly classified by one but not by the other because of different cues learned and utilized. 2) Authors argued using orthogonal weights for the last layer makes the feature generator conditionally independent. Why? 3) What is the baseline for table 1 and 2. If it is a single network, I am not convinced it is a fair comparison. The baseline should have comparable number of parameters with the proposed method, may be two resnets trained separately. 4) More supervised methods could be added in the tables 1,2 along with JAA-Net, which will give readers a better idea on the effectiveness of semisupervised methods. 5) There are some typos, e.g., in line 261, donated --> denoted.