Reviews: Cross-Modal Learning with Adversarial Samples

This paper a new cross-modal correlation learning with adversarial samples. However, this paper still suffers from the following problem: 1. The proposed method aims to learn the minimum perturbations. A major concern is that the two perturbations for two models are learned jointly or separately. A key point for cross-modal learning are the cross-modal correlations, while the proposed scheme seems to learn the optimal perturbations separately. 2. CMLA aims to learn the perturbations which do not change intra-modal similarity relationship. An alternative way is the perturbations can be learned to enhance the robustness and stability in both the cross-modal correlations and the intra-modal similarity. It seems two adverse ends. 3. I cannot see the remarkable difference between adversarial example in the context of cross-modal and single-modal? what the unique properties of cross-modal is adopted in your method and why it is important to cross-modal analysis, esp, considering limited development in single modal adv example? In other words, it may make the paper seeming like a combination of cross-modal analysis and adversarial example, thus reducing the novelty of the paper. 4. The experiments seem insufficient to verify the effectiveness of the proposed method, especially with only two baselines and two datasets. It is expected to show the performance of other challenging cross-modal tasks. In fact, I prefer to see more experiment specifically designed for adversarial example. 5. The notations are confused and unclear, especially for the o_i in problem deﬁnition part. In addition, though we can understand the meaning in the retrieval background, it is still expected to specify the full name of T2T, I2T, I2I, the meanings of different shapes in figure 1, and explain the NR carefully.

Reviewer 2

The paper tries to find adversarial samples in binary hashing when we have two modalities (image and text). The idea is to find the min perturbation to the image (text ) that maximizes the Hamming distance between the binary code of the perturbed image (text) and binary codes of the relevant texts (images). They show that adding these adversarial samples to the test massively decrease the performance of the network. Issues and questions: 1- My main concern is the novelty, as it is well-known that adversarial samples exist in neural network models. How does the current approach make itself different from all previous approaches in finding the adversarial samples? 2- In eq (2), the accuracy is defined based on the argmax of H(). But, H returns a d-dimensional continuous vector and its argmax does not show anything in the hashing literature. This needs more explanation. 3- In eq (3), it is clear that for S_ij=1 (similarity), the objective function finds the perturbation that maximizes the distance. But, what does happen for S_ij=1? 4- My other main concern is experiments. In table 3, the models are trained by adding adversarial samples to the training set. By comparing the results to the table 1, we can see the accuracy decreased significantly (around 4% in some cases). This is disappointing since we want to keep the accuracy as we make the network more robust. ------------------ As the authors mentioned in their response, finding adversarial examples in cross-modal learning is something new in the literature and could lead to more robust networks. I changed my score to above the threshold.

Reviewer 3

1, The idea is interesting and it works according to the experimental results. 2, The objective function in Equation5 includes some equality constraints, I was wondering how to keep this constraint during error back-propagation. And this part has to be carefully studied in the experimental section. 3. The authors stated "In this paper, we propose a novel Cross-Modal correlation Learning with Adversarial samples, namely CMLA, which, for the first time, presents the existence of adversarial samples in cross modalities data." But actually, there is some other works that also explore using adversarial loss to generate addition sample for cross-modal learning. 4. There are also many works on image and sentence matching, which are closely related to your work and should be cited and compared: 1, Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models, In ICCV, 2017. 2, Learning semantic concepts and order for image and sentence matching. In CVPR, 2018. 3, Stacked cross attention for image-text matching. In ECCV, 2018.

Paper ID:	5762
Title:	Cross-Modal Learning with Adversarial Samples

Reviewer 1

Reviewer 2

Reviewer 3