Reviews: This Looks Like That: Deep Learning for Interpretable Image Recognition

The prototypical parts network presented in this work is original and potentially very useful learning framework for domains where process-based interpretability is critical. The method is thoroughly evaluated against alternative approaches and performs comparable to other state-of-the-art interpretable learning algorithms. The paper is well written, well motivated, and is accompanied by empirical results to validate the algorithmic contributions. Overall, I would recommend this paper for acceptance. One place for improvement is the discussion of this work in the context of alternative interpretable approaches, specifically the methods that show comparable accuracy. The authors briefly mention the advantage of having parts based interpretability. However, I would appreciate a longer discussion of the difference between this method and, for example, the RA-CNN which performs slightly better. This discussion would provide additional useful context to the prototypical parts network as well as provide an opportunity to discuss tradeoffs of different approaches. Another place for improvement would bee more thorough exploration of the ProtoPNet architecture and hyper-parameter choices. For example, what is the effect of changing the number of prototypes per class? Do the similarity scores (across prototypes) relate in a meaningful way to the confidence of the final class prediction? I also wonder if there are some domains where finding prototypical images could be a useful goal in and of itself? Curious what the authors think about this direction. ========== response to rebuttal ========== After reading the thorough author response I am have increased my score. I think this paper is a clear accept now.

Reviewer 2

The idea of learning prototypes to improve interpretation is interesting and the comparison with prototypical cases provides clear explanation of the classification result. The presentation of the paper is easy to follow. However, I still have some concerns/questions regarding the method and the experiments. 1. The choice of prototypes P seems essential for the performance. It would better if some more implementation details can be provided. For example, 1) how are P initialized? 2) How to choose H_1 and W_1, since the similarity is based on the L2 distance, the size of the prototypes may be important. 3) Is the update of P (between line 183 and 184) performed every iteration? 4) How to choose m_k? I believe some of the above question might be answered by carefully checking the code provided by the authors. Yet I think it would be helpful to summarize these in the paper, maybe using an algorithm summarization, for readers to get a better overview. 2. The comparison between training latent patches and the prototypes, as well as the update of prototypes may increase the computational cost since the algorithm needs to go over all possible patches. What is the time complexity? 3. The interpretation results seem a bit weak to me, as only Figure 4 provides some comparison with related interpretation methods on one example. =========================== I like the idea of this paper, and the author response has addressed my concerns in complexity, and comparison with other methods. I have changed my score to 6 accordingly.

Reviewer 3

The paper is well written, the problem is well motivated, and references are sufficient. I don't see any major problems. Lines 47-54 present important related work. Did the authors try to compare their results with the results of attention models? My understanding is that attention models would provide regions that the classifier is looking at. Would those methods provide more accurate, or similar, or the same regions? I understand that the main point of ProtoPNet is to identify prototypical cases, but perhaps the regions identified by the attention models would be equally good? Also, what is the problem with the existing attention methods that would not allow one to extract prototypical cases? Is there anything wrong with those architectures that would not allow for that? It would be useful if the authors could explain that. What is the impact of m_k (the number of prototypes for each class k)? Did the authors try smaller/larger values of these parameters? How was the current value of 10 determined? Humans probably look at fewer than 10 prototypical regions in their identification tasks. How important was the L^2 distance in this algorithm? Did the authors try any other metrics? How was L^2 selected?

Paper ID:	4795
Title:	This Looks Like That: Deep Learning for Interpretable Image Recognition

Reviewer 1

Reviewer 2

Reviewer 3